A needle in a haystack: Using a large language model to aid in the identification of species occurrences in unpublished student research reports

Whitmire, Amanda

doi:10.25740/jv039vn8249

A needle in a haystack: Using a large language model to aid in the identification of species occurrences in unpublished student research reports

<a href="https://embed.stanford.edu/iframe/?url=https%3A%2F%2Fpurl.stanford.edu%2Fjv039vn8249" class="su-underline">Show Content</a>

Abstract/Contents

Abstract: Our coastal marine station, along with many others, holds a valuable collection of student research reports spanning several decades. These reports contain historical observations that could play a crucial role in identifying changes in biodiversity and revealing the impacts of climate change. However, a major obstacle in utilizing these observations of marine plants and animals is that they are essentially buried within vast amounts of text, making manual analysis impractical. In recent years, a dedicated team of Stanford Libraries staff has been experimenting with various text-mining approaches on our collection of student reports from Hopkins Marine Station. Our objective has been to identify species occurrences (species + place + date) and verify the accuracy of these observations. While we have achieved some small successes along the way, progress has been hindered by limited staff resources to develop a trained language model and a user interface to efficiently process over 700 papers. Nonetheless, our hopes were reignited by recent advancements in pre-trained large language models (LLMs) such as GPT-4 and LLaMA. In this presentation, I will share our journey of exploring the potential of applying an LLM approach on our corpus. I will delve into the development of prompts to extract taxonomic names, places, habitats, and dates from the corpus, and provide an overview of the workflow that spans from compiling collection metadata to the extraction of species occurrences.

Description

Type of resource	text
Date created	October 25, 2023
Date modified	November 6, 2023
Publication date	October 26, 2023

Creators/Contributors

Author	Whitmire, Amanda	https://orcid.org/0000-0003-2429-8879 (unverified)

Subjects

Subject	Biodiversity
Subject	Marine science libraries
Subject	Text data mining
Subject	Large language model
Genre	Text
Genre	Presentation slides

Bibliographic information

DOI	https://doi.org/10.25740/jv039vn8249
Location	https://purl.stanford.edu/jv039vn8249

Access conditions

Use and reproduction: User agrees that, where applicable, content will not be used to identify or to otherwise infringe the privacy or confidentiality rights of individuals. Content distributed via the Stanford Digital Repository may be subject to additional license and use restrictions applied by the depositor.
License: This work is licensed under a Creative Commons Attribution 4.0 International license (CC BY).

Preferred citation

Preferred citation: Whitmire, Amanda (2023). A needle in a haystack: Using a large language model to aid in the identification of species occurrences in unpublished student research reports. [oral presentation]. 49th Annual Conference of the International Association of Aquatic and Marine Science Libraries and Information Centers (IAMSLIC), British Columbia, Canada. Stanford Digital Repository. Available at https://purl.stanford.edu/jv039vn8249. https://doi.org/10.25740/jv039vn8249.

Collection

Stanford Libraries staff presentations, publications, and research

View other items in this collection in SearchWorks

Contact information

Contact: thalassa@stanford.edu

Also listed in

View in SearchWorks

Loading usage metrics...