Advanced Computing User Day

Claartje Barkhof

Claartje Barkhof is a scientist and integrator in the Advanced Computing Engineering group at TNO. Her work focuses on machine learning research, its integration into computational systems, and its application to societal use cases. She earned a master’s degree in Artificial Intelligence from the University of Amsterdam in 2021, where she also worked as a research assistant at the Institute for Logic, Language, and Computation, focusing on deep generative latent variable modeling.


Session

12-12
12:00
25min
Technology and architecture assessments for scalable and energy-aware training of GPT-NL
Claartje Barkhof, Thomas van Osch

GPT-NL is a publicly funded initiative set to build a sovereign, transparent, and ethically driven Dutch Large Language Model (LLM). Its commitment to a transparent and ethically driven development process requires assessing and choosing training frameworks and training architectures that are efficient and energy aware. Over the last decade, the basic approach to training language models has remained relatively consistent, while the size of the models has grown exponentially. Therefore, increased engineering efforts are dedicated to scaling the model and training process over a large compute pool, as well as implementing an architecture that facilitates close monitoring of such a costly and energy intensive process. In this session, we will share insights into the training process of the GPT-NL model, and design decisions that help exploiting the state-of-the-art Nvidia H100-enabled nodes in the Snellius supercomputer. We will present intermediate results of our effort in designing an architecture that implements a training pipeline while supporting experiment management, traceability and energy monitoring. We will discuss our choices of software stacks for model building (i.e., native PyTorch versus Hugging Face) and distributed training (i.e. PyTorch’s FSDP versus Deep Speed’s ZeRO), supported by experimental results, with a focus on optimizing for (energy) efficient training and effective hardware utilization.

Generative AI and Machine Learning
Quest