A needle in a haystack: Using a large language model to aid in the identification of species occurrences in unpublished student research reports
- Our coastal marine station, along with many others, holds a valuable collection of student research reports spanning several decades. These reports contain historical observations that could play a crucial role in identifying changes in biodiversity and revealing the impacts of climate change. However, a major obstacle in utilizing these observations of marine plants and animals is that they are essentially buried within vast amounts of text, making manual analysis impractical. In recent years, a dedicated team of Stanford Libraries staff has been experimenting with various text-mining approaches on our collection of student reports from Hopkins Marine Station. Our objective has been to identify species occurrences (species + place + date) and verify the accuracy of these observations. While we have achieved some small successes along the way, progress has been hindered by limited staff resources to develop a trained language model and a user interface to efficiently process over 700 papers. Nonetheless, our hopes were reignited by recent advancements in pre-trained large language models (LLMs) such as GPT-4 and LLaMA. In this presentation, I will share our journey of exploring the potential of applying an LLM approach on our corpus. I will delve into the development of prompts to extract taxonomic names, places, habitats, and dates from the corpus, and provide an overview of the workflow that spans from compiling collection metadata to the extraction of species occurrences.
|Type of resource
|October 25, 2023
|November 6, 2023
|October 26, 2023
|Marine science libraries
|Text data mining
|Large language model
- Use and reproduction
- User agrees that, where applicable, content will not be used to identify or to otherwise infringe the privacy or confidentiality rights of individuals. Content distributed via the Stanford Digital Repository may be subject to additional license and use restrictions applied by the depositor.
- This work is licensed under a Creative Commons Attribution 4.0 International license (CC BY).
- Preferred citation
- Whitmire, Amanda (2023). A needle in a haystack: Using a large language model to aid in the identification of species occurrences in unpublished student research reports. [oral presentation]. 49th Annual Conference of the International Association of Aquatic and Marine Science Libraries and Information Centers (IAMSLIC), British Columbia, Canada. Stanford Digital Repository. Available at https://purl.stanford.edu/jv039vn8249. https://doi.org/10.25740/jv039vn8249.
Stanford Libraries staff presentations, publications, and researchView other items in this collection in SearchWorks
Also listed in
Loading usage metrics...