No GPU required: Training and using scalable LLMs on CPUs
2025-12-04 , Progress

Transformer-based LLMs, at scale, are prohibitively expensive to train, requiring massive GPU capacity. Alternative technologies do exist, producing functionally equivalent LLMs at a fraction of the training costs, using high-memory CPU nodes. I will illustrate this with a memory-based LLM trained on Snellius' hi-mem nodes.


Memory-based language modeling, proposed by Van den Bosch (2005), is a machine learning approach to next-token prediction based on the k-nearest neighbor (k-NN) classifier (Aha, Kibler, and Albert, 1991; Daelemans and Van den Bosch, 2006a). This non-neural machine learning approach relies on storing all training data in memory, and generalizes from this training data when classifying unseen new data using similarity-based inference. Memory-based language modeling is functionally roughly equivalent to decoder Transformers (Vaswani et al., 2017), in the sense that both can run in autoregressive text generation mode and predict next tokens based on a certain amount of prior context.

While training a memory-based language model is generally low-cost, as it involves a one-pass reading of training data and does not involve any convergence-based iterative training, a naive implementation would render it useless for inference. The upper-bound complexity of k-nearest neighbor classification is notoriously unfavorable, i.e. O(nd), where n is the number of examples in memory, and d is the number of features or dimensions (e.g. context size). However, improvements and fast approximations are available. Daelemans et al. (2010) offer a range of approximations offering fast classification and data compression using prefix tries. Another notable aspect of memory-based language modeling, as observed earlier by Van den Bosch (2006b), is that its next-word prediction performance increases log-linearly: with every 10-fold increase in the amount of training data, next-word prediction accuracy increases by a more or less constant amount (although there may be a plateau eventually which we never reached because of memory limitations).

The relatively costs in learning as well as inference make memory-based language modeling a potential eco-friendly alternative to the generally costly training of Transformer-based language models (Strubell, 2019). All experiments carried out so far with memory-based language models have been based on publicly available software, with TiMBL as the basic classification engine (https://github.com/LanguageMachines/timbl). All required scripts for training and inference are available on GitHub (https://github.com/antalvdb/memlm).

References

D. W. Aha, D. Kibler, and M. Albert. 1991. Instance-based learning algorithms. Machine Learning, 6:37–
66.

W. Daelemans and A. Van den Bosch. 2005. Memory-based language processing. Cambridge University
Press, Cambridge, UK.

W. Daelemans, J. Zavrel, K. Van der Sloot, and A. Van den Bosch. 2010. TiMBL: Tilburg memory based
learner, version 6.3, reference guide. Technical Report ILK 10-01, ILK Research Group, Tilburg University.

A. Van den Bosch. 2006a. Scalable classification-based word prediction and confusible correction. Traitement Automatique des Langues, 46(2):39–63.

Antal van den Bosch. 2006b. All-word prediction as the ultimate confusible disambiguation. In Proceedings of the Workshop on Computationally Hard Problems and Joint Inference in Speech and Language Processing, pages 25–32, New York City, New York. Association for Computational Linguistics.

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Proceedings of the 31st International
Conference on Neural Information Processing Systems, NIPS’17, page 6000–6010, Red Hook, NY, USA. Curran Associates Inc.

In my research I develop machine learning and language technology. Most of my work involves the intersection of the two fields: computers that learn to understand and generate natural language, nowadays known as Generative AI and Large Language Models. The computational models that this work produces have all kinds of applications in other areas of scholarly research as well as in society and industry. They also link in interesting ways to theories and developments in linguistics, psycholinguistics, neurolinguistics, and sociolinguistics. I love multidisciplinary collaborations to make advances in all these areas.