Webinar STEP 4 FAIRification on DATA / METADATA

EOSC-Nordic

20.10.2021

On 7 October 2021, data repository representatives from across the Nordics and Baltics participated in a FAIRification webinar organised by the EOSC-Nordic project. The webinar was the fourth in a series of multiple steps and focused on DOMAIN-SPECIFIC METADATA. The idea of concentrating on domain-specific metadata as a follow-up topic originates from the exhaustive FAIR maturity evaluations made by the project team in 2020 and 2021. One of the survey conclusions was that many data repositories struggled to enrich the data with the right metadata elements.

It is crucial that the metadata is machine-actionable so that a machine agent can find, interpret and process the data based upon the metadata found, for instance, on the landing page of the repository. In other words, if we want to make research data reusable, the data needs to be enriched with metadata that can be found, interpreted, and processed by machine agents and not only by the human eye. The FAIR principles set out the guideline for FAIRness of data by indicating the relevance and importance of enriching datasets with clear machine-actionable metadata. For more info on all the 15 FAIR Principles, visit GoFair web pages

The FAIRification Webinar Step 4 gave an engaging overview, showing several examples of how metadata experts with their specific communities have managed to derive the required level of metadata by using the input from the domain community. The challenge is defining, designing, reporting, and publishing the domain-specific data elements that a community sees as required for interoperability and data reuse.

The webinar demonstrated that using community standards, templates, vocabularies, and ontologies is essential for defining a required reference metadata set.
The webinar Step 4 also demonstrated that organizing a METADATA for MACHINE WORKSHOP ( M4M ) could be an excellent tool to assist the community in making the right steps in a structured way. For more information on these M4M’s visit GoFair web pages.

The current state of play regarding FAIR maturity levels in Nordic and Baltic data repositories set the scene for the discussion. We were introduced to the very meaning of rich metadata, the concepts of Metadata Templates, Metadata for Machine Workshops (M4Ms), the importance of controlled vocabularies, and the practicalities involved in implementing metadata schemas in and by a repository provider.

Summary of the webinar Step 4

Bert Meerman from the GO FAIR Foundation hosted the webinar. Bert opened by highlighting the 15 FAIR Principles and addressing a few misconceptions around FAIR. The most important ones were :

Under the FAIR Principles, a “machine-agent” or algorithm should be capable of finding, accessing, interpreting, and reusing the data and meta-data of a repository. FAIR is meant for both human and machine interactions and is therefore all about automated findability and machine-actionability of data and metadata.
Note that FAIR data is not the same as “Open and Free data.” The term “Accessible” in FAIR needs to be interpreted as “ Accessible under pre-defined conditions.”
A researcher making data FAIR does NOT mean that the researcher has to “give away” his data for free.

Step by step approach

When it comes to the FAIR assessment of repositories in the Nordic and Baltic countries, the project team has followed a process to increase the FAIR uptake in the Nordics and has already hosted a series of events to further this goal.

1. APR 2020 – First assessment hackathon – Initial exercise
2. NOV 2020 – Webinar Step 1 – Focus on PID (Global, Unique, Persistent, Resolvable)
3. FEB 2021 – Webinar Step 2 – Focus on the split between Data and Metadata.
4. APR 2021 – Webinar Step 3 – Focus on Generic Metadata.
5. OCT 2021 – Webinar Step 4 – Focus on Domain-Specific Metadata.

Summary of the FAIR uptake in the EOSC-Nordic project

Andreas Jaunsen, the FAIR data work package leader from Nordforsk, presented an overview of the project and its progress on the FAIR uptake.

Andreas presented the process of evaluating the FAIR uptake based upon harvesting the landing page of repositories.

He also explained the limitations of repositories that concentrate on publishing their datasets mainly for human consumption and consequently do not give enough attention to machine actionability of the (meta)data. A dataset needs to be FAIR for humans as well as for machines. The concept of “FAIR DIGITAL OBJECT” plays an important role here by permanently and intelligently linking the metadata to the related data sets and vice versa.

Andreas gave context to the topic by describing the FAIR maturity evaluation process as a semi-automated FAIR assessment process, and he presented the most relevant findings. The investigated sample consisted of around 100 data repositories, and from each repository, there is a manual process of manually selecting ten (10) datasets. Experiments show that a sample size of 10 seems to be a good estimator for the entire population of datasets within the repository.

The tool used for the automated FAIR Data Assessment is F-UJI, the open-source software developed by staff from Pangea as part of the FAIRsFAIR project. F-UJI is capable of checking/testing 17 aspects of the FAIR Principles and is streamlined with Google Scripts to process the evaluation of 1000+ datasets.

Andreas presented the result from the FAIR assessments, and unsurprisingly, most of the evaluated repositories did not come out very FAIR regarding machine-actionable metadata. Compared to the results reported in earlier webinars, only marginal improvement of the uptake was recorded. About 24 % of the repositories could not be evaluated. The overall FAIRness score of the majority of repositories showed a slight increase but remains low.

While we saw quantitative improvement in “descriptive core metadata elements“ and “automatically retrieved metadata“ over the last six months, it is risky to make hard conclusions. This small change of false positives could be attributed to updates of the evaluation software.

In the results, there was a noticeable and significantly higher score among certified repositories and the repositories being run on established platforms (Dataverse, Figshare, etc.).

The project will continue to organize activities, with the primary ambition to help repositories achieve higher levels of FAIRness. The project will continue to perform regular FAIR maturity evaluations throughout the project lifespan to monitor increased FAIRness levels.

The subsequent four presentations were presented by experts who gave an overview of the process within their specific community and listed a large number of valuable recommendations:

1. Metadata for the COVID-19 community, at the request of Health RI and research funder ZonMw, presented by Barbara Magagna,
Data Architect, Semantic Expert from the Environment Agency in Austria

Main takeaways of Barbara’s presentation:

Change/improve the traditional research cycle, using FAIR principles
Use Metadata for Machine (M4M) workshops to assist the community.
Involve data-stewards at the start of the FAIRification process.
Add FAIR Metadata expertise to the Domain expertise
Use tools like Linked Data, JSON-LD, and RDF to represent the metadata.
Use controlled vocabularies (W3C specs) to drive interoperability.
Use existing structures like CEDAR workbench and Bioportal to increase the speed of the process.

2. METADATA for the SidO Community, related to Asset and Planning data for the data exchange of data that describes assets in the ground like cables, tubes, and wires. This project is defined by the Dutch Ministry of INFRASTRUCTURE and WATER COMMUNITY and was presented by Annick van Arkel, Director of the Purple Polar Bears, software makers in the Netherlands.

Main takeaways of Annick’s presentation:

Data of exchange partners remain at the source
No unnecessary centralization of data.
Concentrate on Machine-Actionability of the data
Define the data to be shared with the community.
Map local data of an exchange partner to a community agreed format by using a “top layer approach” comparable to a FAIR Data Point.
Define a Data Plan Template (J-SON LD)
Define a Metadata Template (RDF)
Have exchange partners generate a Metadata File (RDF)
Evaluate the Metadata File against a) FAIR Principles, b) Industry standards, and c) Agreed Metadata elements defined by the Project Team.
Approve Metadata File when positively evaluated.

3. METADATA for the Geoscience / Environmental Science community and Pangaea, presented by Robert Huber from the University of Bremen in Germany.

Main takeaways of Robert’s slides:

Pangaea is a very large data collector and data publisher for a large number of essential datasets in the geoscience / environmental community.
Strong focus on ISO certifications to offer and guarantee the provision of very complex user applications.
Darwin core, Darwin core Archive, Dublin Core, and Schema.org are building blocks for interoperability and making metadata visible.

4. METADATA for the WIND ENERGY COMMUNITY, presented by Anna Maria Sempreviva from the Technical University of Denmark.

Main takeaways of Anna Maria’s presentation:

Focus on the creation of Metadata and Taxonomy.
Move from OPEN DATA to FAIR DATA
Move from Available data to Findable data
Make data as open as possible, as closed as necessary.
Connect the relevant data with new ideas.
Willingness to expose/share data is crucial.
Create a searchable data catalog for distributed data. (controlled by data-owners)
Assign taxonomies to metadata (use controlled vocabularies)
Design a data portal as a Virtual Library with a metadata catalog.
Use a Metadata for Machines (M4M) workshop approach to speed up the process.