PIDfest 26

Harnessing PID APIs to discover Canadian datasets (or other subsets of the research ecosystem)

Defining “a Canadian dataset” and identifying entities that satisfy that definition is a surprisingly tricky problem, but the PID ecosystem comes together to provide a solution. Lunaris is a national data discovery service that aims to index metadata describing every Canadian dataset and ensure that those datasets are as easily discoverable as possible. In this session you will learn how Lunaris combines the metadata available from DataCite, ROR, and ORCID to isolate Canadian datasets from a sea of other research objects. We will focus on our mandate of harvesting Canadian datasets, but you will learn how to combine the PID metadata available from these services’ APIs to identify whatever subset of research entities you may care about. You will see a concrete example of how complete PID metadata allows the audience of research outputs to expand beyond what their creators could have initially envisioned.


Lunaris is Canada’s national research data discovery service, operated by the Digital Research Alliance of Canada, a non-profit that operates and funds digital research infrastructure in Canada. Lunaris’ goal is to harvest metadata records describing all Canadian research datasets, making them discoverable in one place and driving traffic back to the repositories that hold those datasets. Harvesting all Canadian datasets is an ambitious goal, and our team has encountered many practical challenges as we develop Lunaris.

Some data sources make it easier on us than others. One practice common to the repositories that are easy for Lunaris to harvest is judicious use of PIDs to identify and describe relevant elements of their ecosystems. In this session you will learn how the Lunaris team harvests a new data source, hear about the challenges involved in identifying and harvesting Canadian datasets, and learn about how data sources that use PIDs to identify and describe their data eliminate those challenges.

Defining what constitutes a “Canadian dataset” is a key challenge in operating Lunaris as a discovery service that is exclusively concerned with Canadian data. Our operational definition is that any data that was produced by at least one Canadian researcher and/or institution is Canadian. We harvest several repositories that exclusively hold Canadian data according to this definition, including repositories maintained by Canadian institutions and government bodies, and in these cases, there is no issue.
However, it is harder to operationalize this definition of Canadian data when harvesting international repositories. In these cases, we need to be able to access metadata describing the affiliation(s) of dataset creators and to be able to identify whether those affiliations are Canadian. When affiliations are in free text, we can attempt to match institution names to a list of Canadian institutions, but difficulties arise with typographical errors and with ambiguous organization names (e.g. “Queen’s University” or “Western University”). Data sources that use RORs for creator affiliations eliminate this issue by providing a unique identifier. ROR resolves the issue of determining which organizations are Canadian by providing a “country” field in the associated metadata records.

That said, in very large data sources like Zenodo, it’s not practical to loop through every single record and check whether any author has a Canadian affiliation: we need a way to proactively query for Canadian datasets. DataCite’s API gives us just that with its “affiliation-id” filter. We can generate a list of Canadian RORs then query DataCite’s API for each.

In general, the creators of these datasets didn’t plan for them to be indexed by Lunaris or any other discovery service. If you create research outputs, describing them with rich metadata including PIDs allows those outputs to be discovered and collected by any community that may be interested. If you are interested in collecting research outputs matching any definition, you will learn in this session how the PID ecosystem enables you to do it.

The speaker's profile picture
Tristan Kuehn

Tristan Kuehn is Product Lead, Discovery Services at the Digital Research Alliance of Canada. Tristan manages Lunaris, the Alliance's Canadian data discovery service; serves as technical lead for the DataCite Canada Consortium; and serves as staff liaison for the Alliance's Discovery and Metadata Expert Group.