Note: this is an adapted piece from an article originally published in the Frictionless Data website
Context
One of the main goals of the
Frictionless Data project is to help improve data quality by providing easy to integrate libraries and services for data validation. We have integrated data validation seamlessly with different backends like GitHub and Amazon S3 via the online service
goodtables.io, but we also wanted to explore closer integrations with other platforms.
An obvious choice for that are Open Data portals. They are still one of the main forms of dissemination of Open Data, especially for governments and other organizations. They provide a single entry point to data relating to a particular region or thematic area and provide users with tools to discover and access different datasets. On the backend, publishers also have tools available for the validation and publication of datasets.
Data Quality varies widely across different portals, reflecting the publication processes and requirements of the hosting organizations. In general, it is difficult for users to assess the quality of the data and there is a lack of descriptors for the actual data fields. At the publisher level, while strong emphasis has been put in metadata standards and interoperability, publishers don’t generally have the same help or guidance when dealing with data quality or description.
We believe that data quality in Open Data portals can have a central place on both these fronts, user-centric and publisher-centric. With that in mind we created
ckanext-validation, a CKAN extension that provides a low level API and readily available features for data validation and reporting that can be added to any CKAN instance. This is powered by
goodtables, a library developed by Open Knowledge International to support the validation of tabular datasets.
What does ckanext-validation do?
The extension allows users to perform data validation against any tabular resource, such as CSV or Excel files. This generates a report that is stored against a particular resource, describing issues found with the data, both at the structural level (missing headers, blank rows, etc) and at the data schema level (wrong data types, values out of range etc).
data validation on CKAN made possible by ckanext-validation extension
This provides a good overview of the quality of the data to users but also to publishers so they can improve the quality of the data file by addressing these issues. The reports can be easily accessed via badges that provide a quick visual indication of the quality of the data file.
badges indicating quality of data files on CKAN
There are two default modes for performing the data validation when creating or updating resources. Data validation can be automatically performed in the background asynchronously or as part of the dataset creation in the user interface. In this case the validation will be performed immediately after uploading or linking to a new tabular file, giving quick feedback to publishers.
data validation on upload or linking to a new tabular file on CKAN
The extension adds functionality to provide a
schema for the data that describes the expected fields and types as well as other constraints, allowing to perform validation on the actual contents of the data. Additionally the schema is also stored with the resource metadata, so it can be displayed in the UI or accessed via the API.
The extension also provides some utility commands for CKAN maintainers, including the generation of
reports showing the number of valid and invalid tabular files, a breakdown of the error types and links to the individual resources. This gives maintainers a snapshot of the general quality of the data hosted in their CKAN instance at any given moment in time. You can see example reports in this
GitHub repository.
Testing it in the wild
To field test our implementation we chose the
Western Pennsylvania Regional Data Center (WPRDC), managed by the
University of Pittsburgh Center for Urban and Social Research. The Regional Data Center made for a good pilot as the project team takes an agile approach to managing their own CKAN instance along with support from OpenGov, members of the CKAN association. As the open data repository is used by a diverse array of data publishers (including project partners Allegheny County and the City of Pittsburgh), the Regional Data Center provides a good test case for testing the implementation across a variety of data types and publishing processes. WPRDC is a great example of a well managed Open Data portal, where datasets are actively maintained and the portal itself is just one component of a wider Open Data strategy. It also provides a good variety of publishers, including public sector agencies, academic institutions, and nonprofit organizations.
We imported all datasets from the WPRDC portal and imported them to
a demo site that mirrors the datasets, organizations and groups hosted there (at the time we did the import). In there we run the validation process against all datasets, and generated reports to analyse their issues. All tabular resources have the validation report attached, that can be accessed clicking on the data valid / invalid badges.
You can learn more about the findings (and the caveats) on the
WPRDC pilot page in the Frictionless Data website.
Next Steps
The validation extension for CKAN currently provides a very basic workflow for validation at creation and update time: basically if the validation fails in any way you are not allowed to create or edit the dataset. Maintainers can define a set of default validation options to make it more permissive but even so some publishers probably wouldn’t want to enforce all validation checks before allowing the creation of a dataset, or just apply validation to datasets from a particular organization or type. Of course the
underlying API is available for extension developers to implement these workflows, but the validation extension itself could provide some of them.
The user interface for defining the validation options can definitely be improved, and we are planning to integrate a
Schema Creator to make easier for publishers to describe their data with a schema based on the actual fields on the file. If the resource has a schema assigned, this information can be presented nicely on the UI to the users and exported in different formats.
The validation extension is a first iteration to demonstrate the capabilities of integrating data validation directly into CKAN, but we are keen to know about different ways in which this could be expanded or integrated in other workflows, so any feedback or thoughts is appreciated.
Additional Resources