Fundamental bottlenecks for AI and HPC
, Progress

Snellius and other HPC systems are not magic, even if it sometimes may feel so.
Efficient usage of the available hardware is the difference between a model that may be just 'OK', or a model that is State-of-the-art (and I have examples in my pocket to prove it!).
Trusting that whatever you throw at the system will be efficient 'automagically' is the quickest way to burn GPU hours without getting what you really want: breakthrough science!


Is your dataloader asleep at the wheel? Is over-eager logging killing your performance because it's forcing CPU<->GPU syncs? Does 100% GPU utilization actually mean that your GPU is being used effectively? (Hint: it's not!)
In this talk we'll go over the fundamental bottlenecks of compute: those things in any HPC system that will cause your workflow to be slower than it needs to be, and what you can do to transform your workflow from 'it eventually works' to 'it works remarkably well'.

I have been at SURF for 5 years as an AI+HPC advisor, helping AI researchers make GPUs go brrr on our National Supercomputer Snellius. Few parts of the intersection of AI+HPC remain a mystery to me, and as an aspiring wise old man I love to preach the gospel of efficiency to anyone that will listen.