Developing Robust Search with Open-Source LLMs
2025-12-04 , Progress

While open-source search models have greatly improved with transformer-based architectures, they face challenges outside their training domain, such as when applied to multi-modal or non-English text data. In this talk, we will describe some of our ongoing work developing new open-source models to address these challenges:

  • Multilingual retrieval. We train an effective multilingual sparse retrieval model achieving state-of-the-art performance on standard multilingual benchmarks while continuing to perform well in English.
  • Multimodal retrieval. We improve multimodal retrieval for the visual document retrieval task with an approach leveraging existing vision-language models.
  • Complex retrieval. We develop query expansion for complex information needs that cannot be handled well with standard methods.
  • Synthetic data generation. We explore synthetic data generation for enabling training and evaluation on broader scenarios like retrieval-augmented generation (RAG).
  • Efficient retrieval models. Given the increased computational costs of using LLMs for retrieval, we explore several strategies for improving their efficiency, including an effective pruning approach that results in smaller models with comparable performance and engineering work.

Our talk will describe our research on robust search with open source LLMs and briefly describe our engineering work developing a Triton kernel to speed up training and inference with learned sparse retrieval models, with both efforts leveraging the computational power of the LUMI supercomputer.

I am a PhD student at IRLab, Amsterdam. My research interests include representation learning and retrieval-augmented generation