Dylan Ju

I am a PhD student at IRLab, Amsterdam. My research interests include representation learning and retrieval-augmented generation


Session

12-04
15:25
25min
Developing Robust Search with Open-Source LLMs
Dylan Ju, Yibin Lei, Thong Nguyen

While open-source search models have greatly improved with transformer-based architectures, they face challenges outside their training domain, such as when applied to multi-modal or non-English text data. In this talk, we will describe some of our ongoing work developing new open-source models to address these challenges:

  • Multilingual retrieval. We train an effective multilingual sparse retrieval model achieving state-of-the-art performance on standard multilingual benchmarks while continuing to perform well in English.
  • Multimodal retrieval. We improve multimodal retrieval for the visual document retrieval task with an approach leveraging existing vision-language models.
  • Complex retrieval. We develop query expansion for complex information needs that cannot be handled well with standard methods.
  • Synthetic data generation. We explore synthetic data generation for enabling training and evaluation on broader scenarios like retrieval-augmented generation (RAG).
  • Efficient retrieval models. Given the increased computational costs of using LLMs for retrieval, we explore several strategies for improving their efficiency, including an effective pruning approach that results in smaller models with comparable performance and engineering work.
Generative AI and Machine Learning
Progress