Advanced Computing User Day

Generative AI for Research Data Processing: Lessons Learnt From Three Use Cases
2024-12-12 , Quest

We demonstrate the feasibility of using generative AI for processing large quantities of heterogeneous, unstructured data. In an exploratory study on the application of this new technology in research data processing, we identified tasks for which rule-based or traditional machine learning approaches were difficult to apply, and then performed these tasks using generative AI.

We used generative AI in three research projects from three different domains, involving the following complex data processing tasks:

1) Information extraction: Extraction of plant species names from historical seedlists (catalogues of seeds) published by botanical gardens.
2) Natural language understanding: Extraction of certain data points (name of drug, name of health indication, relative effectiveness, cost-effectiveness, etc.) from documents published by different Health Technology Assessment organisations in the EU.
3) Text classification: Assignment of industry codes to projects on the crowdfunding website Kickstarter.

We share the lessons we learnt from these use cases: How to determine if generative AI is an appropriate tool for a given data processing task, and if so, how to maximise the accuracy and consistency of the results obtained using generative AI.

Modhurita Mitra is a Research Engineer in the Research and Data Management Services department at Utrecht University.

She earned bachelor's and master's degrees in physics from the Indian Institute of Technology, Kharagpur in India, and a PhD in astronomy from the University of Illinois at Urbana-Champaign in the United States. She held postdoctoral positions at Rhodes University in South Africa and the University of Groningen in the Netherlands. She has extensive experience in processing, analysis, and simulation of astronomical data.

Her current work focuses on the application of generative AI in research data processing. She has demonstrated the utility of generative AI as a powerful new general-purpose technology by applying this tool to data from diverse research domains - botany, pharmaceutical sciences, economic geography, and history, to perform a wide range of complex data processing tasks - information extraction, natural language understanding, text classification, and sentiment analysis.