GeoE3 Project Task 2.4 Demo Environment by organization Data Science Team (DST) on Kadaster Triple Store.

Introduction

The following environment serves to both outline the requirements and processes required to improve search engine findability for datasets housed in a national geoportal catalogue. In general, a search engine is able to find information more easily when metadata about the information is indexable by the search engine itself. Making a webpage indexable requires that metadata is embedded into the HTML header of a webpage and is making use of the schema.org vocabulary. For datasets specifically then, the metadata which already exists for a given dataset in catalogues, for example, could be used in the headers of webpages to improve the findability of that dataset in a search engine. This requires the implementation of various process steps as will be demonstrated here.

As a demonstration environment, the process described below was tested and has been implemented within the context of Kadaster, the Dutch National Mapping Agency. Although the results are transferable to other environments, the process described below would likely require some adaptation when implemented in a different context.

A Note on Tooling

The demonstration provided here was implemented by making exclusive use of the features and functionality of TriplyDB. It is possible to make use of this environment for testing purposes. For this, please contact the content provider for your module. The following demonstration environment requires that a SPARQL service is started and queries are performed. For a complete overview of how to perform these steps using the TriplyDB interface, please refer to the TriplyDB documentation.

Requirements and Implementation Steps

In order to prepare and set up a process for search engine optimisation for dataset findability, the following steps are required:

DCAT-SDO Mapping Definition. If the goal is to optimise dataset findability in the Google search engine, the use of schema.org for the definition of the metadata structure is a requirement. Because much dataset metadata exists as DCAT-AP, a mapping of DCAT to schema.org should be defined. This mapping will form the basis of a SPARQL construct query which is used to transform DCAT metadata into SDO metadata. The full overview of the schema.org vocabulary can be found here.

Implementation of ETL pipeline. Because much of the metadata pertaining to datasets is published in geoportals or national data portals, improving the findability of datasets on the web requires that this metadata is transformed and embedded into webpages in a manner which makes these webpages indexable by the search engine. The ETL pipeline is required in this process to load all the raw metadata into a triplestore from the national geoportal, transform this to the schema.org profile and then embed the correct metadata into the correct webpage for a given dataset. This process is illustrated in the figure below.

Figure 1. ETL Process for Including SDO Metadata in Webpages of Datasets

Concretely, this process involves the following steps:

In the Dutch context, metadata was taken from the national geoportal and embedded into the national catalogue for governmental open data in order to improve findability in the latter. If this is the case for a different context, HTML pages for the national geoportal should be scraped and matched to identifiers in the second environment. This can be done using GUIDs.

The raw metadata from the national portal should be imported into a triplestore from an API (if necessary, based on the GUIDs). If making use of the TriplyDB instance, please refer to the documentation on how to do this.

The raw metadata which is being used in the ETL implemented in the Dutch context can be seen here.

Raw metadata from the geoportal should be transformed to the schema.org profile in accordance with Google compliance requirements using SPARQL construct queries. These construct queries are defined based on the mapping defined in the previous implementation step.

The SPARQL queries being used in the Dutch context for mapping between DCAT and schema.org can be seen here. One query is used in transforming dataset metadata to the schema.org profile and the other is used in transforming service metadata. Note: Implementing these SPARQL queries requires that a SPARQL service is started on the dataset containing the raw DCAT metadata from the geoportal.

Transformations should be, but do not need to be, validated using SHACL to ensure compliance with the metadata profile mapping defined. This requires an instance of a SHACL instance which is not included in the TriplyDB tool set.

The transformed metadata can be found here.

Transformed metadata should be (re)serialised as JSON-LD files. If using TriplyDB as your triplestore environment, these should be uploaded as assets.

The assets serialised as JSON-LD can be found here.

Once the relevant transformations have been applied, a REST-over-SPARQL API can be used to import the transformations into the HTML of each associated dataset page in the required environment. In the Dutch context, this ETL is currently automatically performed every 24 hours based on an open source script which ensures that any updates in the national geoportal metadata records result in an update in the second environment and is delivered as close to real-time as possible.

Rich Results Test

Once the transformed metadata records had been appropriately integrated into the metadata of the HTML page of each dataset, a validation step is required in order to ensure the quality of this metadata and the ability of the search engines to index these pages to the fullest extent. In the Dutch context, the metadata included in the HTML pages was validated by the Rich Results Test and found to be, in principle, indexable by Google. As such, this process achieved the desired quality in its results as defined by the quality criteria and tolerances in the product description. Ongoing management of the quality of the metadata embedded in HTML pages for indexing can be done by using Google’s Search Console functionality. This environment offers users the ability to identify dataset pages where poor metadata quality leads to a lack of indexing and highlights the appropriate fixes based on schema.org specifications. This tool is openly available to all.

Google Dataset Search

A further validation step which can be done is to ensure that the datasets in question are actually being indexed by Google in practice. This testing can be done by making use of Google’s Dataset Search and assessing the results for the presence of the dataset in question.