You digitized it, but now what? Exploring computational methods for extracting biodiversity data from historical collections
- Climate change is driving rapid changes in our biosphere on local and global scales. Our capacity to understand these shifts relies entirely upon two critical things: long-term observations, and an ability to discover and access them. Species occurrence data, which includes the occurrence of a species at a particular place on a specified date, are foundational to understanding biodiversity and tracking changes due to the effects of climate change. Open knowledge bases that gather species occurrence records enable researchers to assess spatio-temporal changes in biodiversity, but observations from the past recorded on paper are often missing. Libraries at several academic marine research stations on the West Coast of North America hold large physical collections of undergraduate student reports. These reports include field observations of species occurrences and populations recorded over a span of nine decades. Each library collection is important within its local context, but taken collectively these papers represent an extremely valuable corpus for conducting biodiversity research. Even after digitization, however, observational data in these papers are still “hidden” in the text. Reading and extracting those data by hand is an effort we cannot realistically undertake. In this presentation, I will describe a collaborative project in which we explore the potential of natural language processing, machine learning, and data visualization to identify and verify species occurrences in unpublished student research papers. I will review how we approach identifying relevant entities in the texts, link them to taxonomic authorities, and create derivative datasets. The final goal of the project is to serve the species occurrence metadata to relevant aggregators, e.g., the Global Biodiversity Information Facility. The overarching message of this talk will be how we can take advantage of computational methods to amplify the work of information professionals in surfacing historical biodiversity data.
|Type of resource
|February 15, 2022; December 5, 2022
|February 9, 2022; February 9, 2022
|Text data mining
|Natural language processing
- Use and reproduction
- User agrees that, where applicable, content will not be used to identify or to otherwise infringe the privacy or confidentiality rights of individuals. Content distributed via the Stanford Digital Repository may be subject to additional license and use restrictions applied by the depositor.
- This work is licensed under a Creative Commons Attribution 4.0 International license (CC BY).
- Preferred citation
- Whitmire, A. (2022). You digitized it, but now what? Exploring computational methods for extracting biodiversity data from historical collections. Stanford Digital Repository. Presented at the International Ocean Data Conference 2022, Available at https://purl.stanford.edu/mp178ym9045
Stanford Libraries staff presentations, publications, and researchView other items in this collection in SearchWorks
Also listed in
Loading usage metrics...