On March 2017, the
Civic Analytics Network (CAN) - a consortium of Chief Data Officers and analytics principals throughout the United States, published the
Open Letter to the Open Data Community.
It listed eight guidelines “
that, if followed, would advance the capabilities of government data portals across the board and help deliver upon the promise of a transparent government.”
Since then, a lot has transpired in the open data ecosystem — just in the past few months,
CAN released an update to the Open Letter a year later (June 2018); the US Government published the initial draft of its
Federal Data Strategy (July 2018); the US General Services Administration (GSA) issued an RFQ to build a
US government-wide data platform based on CKAN (August 2018); and
Google launched its Dataset Search service (Sept 2018).
As the custodian of the CKAN project - the platform behind the largest data portals in the world, the CKAN Association celebrates this emerging consensus of
Data as Infrastructure, and recognizes it needs to get actively involved in the conversation as the leading open source data portal platform.
Data is Infrastructure
Data is essential infrastructure of the Information Age. It is as important as our physical infrastructure.
Much the same way the critical infrastructure of the
Technology Ages were accelerated by the adoption of
open standards (e.g. the
standard gauge,
mass production,
the shipping container, the
internet, the
Web, etc.), we believe that our digital infrastructure should be built not just on open standards, but also with preferential treatment for
open source.
Beyond the reasons outlined in this paper - “
Why Open Source Software Matters for Government and Civic Tech - and How to Support It”, doing so affords users the Freedom to collaborate and innovate within an
ecosystem - of providers, entrepreneurs, funders, researchers, practitioners, standards bodies and governments.
Eight Guidelines for Open Data
As the
third CKANcon at the
International Open Data Conference in Buenos Aires draws near, the CKAN Association is sharing our long overdue response to the Open Letter:
1. Improve accessibility and usability of data to engage a wider audience.
Thanks to its open source foundation,
hundreds of community-developed extensions and
widespread use around the world, the CKAN ecosystem has advanced on all the key features and recommendations for this principle.
- Portals should have simple and fast data downloads coupled with more user-friendly design for understanding data.
CKAN now has fast data downloads AND exponentially faster uploads that scales thanks to contributions like the cloudstorage and express loader extensions.
UI/UX has also advanced. On the user interface (UI) side, Boston’s IDEO based skin is but one of many available themes for others to use. On the User Experience (UX) front, innovations like Energinet’s Interactive Filtering Interface and the new Data Dictionary allow even casual users to understand and use the data.
- Accessibility and usability changes should be data-driven; that is, reflective of what datasets and features are being used most.
CKAN’s Google Analytics extension allows data publishers to keep pulse on how their portal is being utilized using the web analytics tool they already use.
Also, multiple mechanisms to gather feedback from the field are available - apart from traditional channels like Twitter, YouTube, Stack Overflow, mailing lists and IRC, there is the Open Knowledge Network, a public repository for ideas and the roadmap, and events like CKANcon to bring the community together - to learn, collaborate and build with each other.
- Portals should include more space for conversations on the use of, and resources for, datasets including user guides, dashboards, and social media communication.
There are several options for conversations - Disqus, Discourse, Data Requests, Issues and Wordpress Comments to name a few. The Showcase extension also allows data publishers to show their data in action. With the Pages extension, they can also add additional content beyond data with a simple CMS, with available integrations to full blown CMSes like Drupal and Wordpress. There are even commercial products like OpenGov Stories that integrate with CKAN to publish data-driven narrative.
- Portals should be exemplars of good web design (e.g., many portals are not mobile friendly).
CKAN has been mobile-friendly since v2.3 (Mar 2015), and its pluggable theming system allows CKAN instances to optimize their UI without changing CKAN core code.
Also, open source projects like Canada’s Web ExperienceToolkit help CKAN to become a true exemplar of web design - making sure its not only mobile friendly, but accessible, usable, interoperable and multilingual too.
As is, CKAN supports 79 languages to varying degrees using the same open source dynamics so that even non-developers can help translate CKAN. And with the fluent extension - field values, not just UI text, can be multilingual as well.
- User research from providers (who is using my data - where are the gaps?), including strategies for marketing open data to populations that might not be using it, would be helpful. This might include additional applications or view options for those communities.
The Global Open Data Index by Open Knowledge International (OKI) helps publishers track their progress and compare themselves with their peers. In the US, OKI collaborated with the Sunlight Foundation earlier this year to relaunch the US City Open Data Census. With it, US cities and open data advocates have objective means to track their progress whilst creating a positive feedback loop with some healthy competition.
Major investments into reusable training resources by major CKAN installations like the European Data Portal help everyone - both publishers and consumers. Extensions like the openapi-viewer from British Columbia help CKAN installations create live, interactive developer documentation using their own data.
Along with expert guidance from trusted institutions like the Open Data Institute, the Government Center of Excellence, Sunlight Open Cities, the Civics Analytics Network, the Governance Lab, and What Works Cities to name a few, data publishers now have several playbooks and networks to effectively engage communities with their data, and put their data to work.
- Portals should include more intuitive ways of visualizing and exploring data. It’s time for open data to move beyond spreadsheets on the web.
CKAN is a data management platform. It’s not just a data publishing application. It's been used to build a research data hub, a legacy data repository, a Smart City data exchange, even a national energy portal with real-time feeds.
As such, CKAN’s resource view mechanism, along with its plug-in architecture, has allowed the community to create several integrations with best-of-breed tools for visualizing and exploring data - Tableau, Plotly, Carto, OpenGov Operational Performance, and Data.world among others.
Last year, in addition to the recline.js viewer (the default CKAN “spreadsheet viewer”), the datatables.net viewer was also added in response to usability studies conducted by City of Boston in collaboration with the Usability Lab at Simmons University.
Advanced CKAN installations like the Humanitarian Data Exchange and the Western Pennsylvania Regional Data Center (WPRDC) continue to push the envelope by building (and open-sourcing) compelling data exploration and visualization tools beyond spreadsheets on the web.
- In addition to view totals for datasets, cities’ data portals should also publish information on which datasets are being downloaded.
In addition to CKAN’s built-in stats, the Google Analytics extension provides a way to track which datasets are being downloaded, not just on the reports that are only available to administrators, but to general users as well.
WPRDC also built and open-sourced the WPRDC Dashboard - a custom analytics dashboard that has even finer-grained analytics tailored to their use cases that they can share with a wider audience. It’s an instrumental tool WPRDC uses to provide feedback to its data stewards and publishers about how many people benefit from their efforts to publish open data.
2. Move away from a single dataset centric view.
This is already the default behaviour of CKAN - with its notion of a
Dataset Package with multiple Resources.
And with the imminent release of
Resource Queries in CKAN v3.0, data publishers can create
complex SQL queries to create derived datasets. For example, from a 311 master file, a publisher can automatically create different subsets of the data (e.g. by neighborhood; by year;
join 311 graffiti/light outage with crime data, etc.), without having to upload and maintain these derivative datasets separately.
Frictionless Data, another OKI project, also introduces the notion of
Data Packages (think “
Docker for Data”) which CKAN now also supports with the
Data Curator editor, and the
datapackager and
validation extensions.
3. Treat geospatial data as a first-class data type.
Various experiments are being done in this space by several CKAN installations. Of note are the
DataSpatial extension from
UK Natural History Museum, where they support advanced spatial search using PostGIS and Solr;
TerriaJS from Australia’s research agency - the Commonwealth Scientific and Industrial Research Organization (CSIRO); and the
EU’s Next Generation Global Earth Observation Systems of Systems (NextGEOSS) Data Hub.
Other geospatial experiments include
configurable, batched-geocoding;
integration with ESRI beyond harvesting; and
native integration with Chicago’s OpenGrid.
Beyond these, the CKAN v3 roadmap that will be released at CKANcon@IODC18 directly addresses making geospatial data a first-class data type.
4. Improve management and usability of metadata.
The
Data Dictionary feature introduced in v2.7 goes a long way towards automating the creation of high-quality metadata. By making it an integral part of the upload workflow, and taking it beyond just documenting the dataset schema to double as an administrative interface to manage the column data types of the underlying dataset, this feature directly improves the management and usability of metadata.
The integration of the
scheming extension in CKAN core for v3 should also make the creation and sharing of custom metadata schemas much easier.
Working closely with organizations that consume metadata has also paid dividends.
Google’s involvement, for instance, in adding schema.org JSON-LD support in the
DCAT extension has resulted in
CKAN support of Google Dataset Search when it was
announced earlier this month, on day one.
5. Decrease the cost and work required to publish data.
There are multiple mechanisms to automate data publishing in CKAN. There are free connectors for enterprise-class Extract, Transform & Load (ETL) tools like
Pentaho Kettle and
Safe FME. There are also various API clients that can be integrated into just about any ETL automation pipeline -
for Python, Javascript, Ruby, PHP, Java, Perl, R and the command-line. There’s even a
CKAN OpenRefine connector. And with the aptly-named
Express Loader, CKAN can now reliably
load large datasets exponentially faster.
6. Introduce revision history.
CKAN’s revision history is currently being revamped with an expanded Activity Streams feature that’s slated for release in CKAN v3.0. It behaves more like an audit trail - allowing users to see not just old versions of datasets, but differences between metadata values between revisions as well!
For a preview,
check this presentation by David Read at CKANconUS.
7. Improve management of large datasets.
CKAN can now reliably work with large datasets. Apart from Express Loader, contributions like the
Apache Libcloud powered
cloudstorage extension now allow CKAN to work with 50+ cloud providers to upload datasets up to 5 TB in size!
And since the datasets are stored in the cloud provider’s infrastructure, we gain the added benefit of high availability and access to additional data processing capabilities offered by these providers (e.g. when using AWS, take advantage of services like
Lambda to do serverless computing, or
Athena for ad-hoc queries; when using Azure,
take advantage of Azure storage filters to power performant interactive filters, etc.)
Also, a frequent criticism of CKAN is that it is not multi-tenant. In our opinion, this is actually an advantage in this age of
containerization.
First generation Software-as-a-Service (SaaS) solutions were often slow because their
shared-everything multi-tenant architecture often meant high utilization by one tenant (say, because of the release of a popular dataset, or the running of a hackathon) would impact the performance of another tenant. Containerization also has the added benefit of true process, data and security isolation as several CKAN installations use it as internal data-sharing infrastructure (“open data inside”).
8. Set clear transparent pricing based on memory, not number of datasets.
The open source status of CKAN allows for a variety of service models to emerge. From SaaS to professional services models which support on premise installations with white glove support. That said, none of the existing CKAN-as-a-Service providers employ the “nickel-and-diming” per dataset pricing model as we want to incentivize, not penalize use.
This extends to per-seat licensing as well - there is none. Further, each CKAN instance has full access to their users’ profiles and have the ability to craft their own terms of service. Contrast this with some proprietary data portal provider’s terms of service that assert very one sided grant of rights to the proprietary portal provider. It is your data, they are your users - not the portal provider’s.
Furthermore, your budget directly funds a
lot of contributions to the CKAN ecosystem as all the existing CKAN providers contribute and sponsor core development — new features and innovations largely informed by the work they do with their customers.
And since its open source - there’s no lock-in. If pricing ever becomes an issue, publishers can export all their data and move to another CKAN provider, or their own CKAN instance, or adopt another solution entirely.
Conclusion
In its long history, the CKAN project has always been at the heart of the Open Data movement, driven by a wide community of users, developers and stakeholders. As the movement evolves and faces new challenges, we are keen on CKAN responding to these changes in order to keep serving the community.
In that regard, CAN's Open Letter provides invaluable guidelines. We look forward to engaging CAN and other stakeholders - so we can build together the Data Infrastructure that will deliver the transparent, data-driven, 21st century government we expect and deserve.