A needle in a haystack: Using a large language model to aid in the identification of species occurrences in unpublished student research reports
Abstract/Contents
- Abstract
- Our coastal marine station, along with many others, holds a valuable collection of student research reports spanning several decades. These reports contain historical observations that could play a crucial role in identifying changes in biodiversity and revealing the impacts of climate change. However, a major obstacle in utilizing these observations of marine plants and animals is that they are essentially buried within vast amounts of text, making manual analysis impractical. In recent years, a dedicated team of Stanford Libraries staff has been experimenting with various text-mining approaches on our collection of student reports from Hopkins Marine Station. Our objective has been to identify species occurrences (species + place + date) and verify the accuracy of these observations. While we have achieved some small successes along the way, progress has been hindered by limited staff resources to develop a trained language model and a user interface to efficiently process over 700 papers. Nonetheless, our hopes were reignited by recent advancements in pre-trained large language models (LLMs) such as GPT-4 and LLaMA. In this presentation, I will share our journey of exploring the potential of applying an LLM approach on our corpus. I will delve into the development of prompts to extract taxonomic names, places, habitats, and dates from the corpus, and provide an overview of the workflow that spans from compiling collection metadata to the extraction of species occurrences.
Description
Type of resource | text |
---|---|
Date created | October 25, 2023 |
Date modified | November 6, 2023 |
Publication date | October 26, 2023 |
Creators/Contributors
Author | Whitmire, Amanda |
![]() |
---|
Subjects
Subject | Biodiversity |
---|---|
Subject | Marine science libraries |
Subject | Text data mining |
Subject | Large language model |
Genre | Text |
Genre | Presentation slides |
Bibliographic information
Access conditions
- Use and reproduction
- User agrees that, where applicable, content will not be used to identify or to otherwise infringe the privacy or confidentiality rights of individuals. Content distributed via the Stanford Digital Repository may be subject to additional license and use restrictions applied by the depositor.
- License
- This work is licensed under a Creative Commons Attribution 4.0 International license (CC BY).
Preferred citation
- Preferred citation
- Whitmire, Amanda (2023). A needle in a haystack: Using a large language model to aid in the identification of species occurrences in unpublished student research reports. [oral presentation]. 49th Annual Conference of the International Association of Aquatic and Marine Science Libraries and Information Centers (IAMSLIC), British Columbia, Canada. Stanford Digital Repository. Available at https://purl.stanford.edu/jv039vn8249. https://doi.org/10.25740/jv039vn8249.
Collection
Stanford Libraries staff presentations, publications, and research
View other items in this collection in SearchWorksContact information
- Contact
- thalassa@stanford.edu
Also listed in
Loading usage metrics...