A needle in a haystack: Using a large language model to aid in the identification of species occurrences in unpublished student research reports

Placeholder Show Content

Abstract/Contents

Abstract
Our coastal marine station, along with many others, holds a valuable collection of student research reports spanning several decades. These reports contain historical observations that could play a crucial role in identifying changes in biodiversity and revealing the impacts of climate change. However, a major obstacle in utilizing these observations of marine plants and animals is that they are essentially buried within vast amounts of text, making manual analysis impractical. In recent years, a dedicated team of Stanford Libraries staff has been experimenting with various text-mining approaches on our collection of student reports from Hopkins Marine Station. Our objective has been to identify species occurrences (species + place + date) and verify the accuracy of these observations. While we have achieved some small successes along the way, progress has been hindered by limited staff resources to develop a trained language model and a user interface to efficiently process over 700 papers. Nonetheless, our hopes were reignited by recent advancements in pre-trained large language models (LLMs) such as GPT-4 and LLaMA. In this presentation, I will share our journey of exploring the potential of applying an LLM approach on our corpus. I will delve into the development of prompts to extract taxonomic names, places, habitats, and dates from the corpus, and provide an overview of the workflow that spans from compiling collection metadata to the extraction of species occurrences.

Description

Type of resource text
Date created October 25, 2023
Date modified November 6, 2023
Publication date October 26, 2023

Creators/Contributors

Author Whitmire, Amanda ORCiD icon https://orcid.org/0000-0003-2429-8879 (unverified)

Subjects

Subject Biodiversity
Subject Marine science libraries
Subject Text data mining
Subject Large language model
Genre Text
Genre Presentation slides

Bibliographic information

Access conditions

Use and reproduction
User agrees that, where applicable, content will not be used to identify or to otherwise infringe the privacy or confidentiality rights of individuals. Content distributed via the Stanford Digital Repository may be subject to additional license and use restrictions applied by the depositor.
License
This work is licensed under a Creative Commons Attribution 4.0 International license (CC BY).

Preferred citation

Preferred citation
Whitmire, Amanda (2023). A needle in a haystack: Using a large language model to aid in the identification of species occurrences in unpublished student research reports. [oral presentation]. 49th Annual Conference of the International Association of Aquatic and Marine Science Libraries and Information Centers (IAMSLIC), British Columbia, Canada. Stanford Digital Repository. Available at https://purl.stanford.edu/jv039vn8249. https://doi.org/10.25740/jv039vn8249.

Collection

Stanford Libraries staff presentations, publications, and research

View other items in this collection in SearchWorks

Contact information

Also listed in

Loading usage metrics...