SURF Netwerk & Cloud event

Enabling Scalable, Efficient, and Collaborative Scientific Workflows with modern data lake architecture
30-9-2025 , Erik de Vries

As research becomes increasingly data-driven and collaborative, there is a critical need for modern, scalable infrastructure to manage vast and diverse datasets. Traditional data management systems, which often rely on rigid hierarchies and predefined schemas, are proving inadequate in the face of growing data volumes, variety, and velocity. To address these challenges, we are examining the concept of a data lake: an open, flexible, and powerful architecture for storing and analysing research data across disciplines and formats.

A data lake is a centralised repository that stores data in its raw form, accommodating structured, semi-structured, and unstructured formats. Unlike conventional data warehouses, it uses flat object storage combined with rich metadata tagging to enable efficient, scalable data access. This architecture supports a wide range of analytical and machine learning tools without requiring data to be moved or duplicated, thereby increasing cost-efficiency and reducing complexity.
Find out how our proposed architecture paves the way for more reproducible, transparent, and efficient scientific workflows, empowering researchers to derive deeper insights and drive innovation at scale.