Blog

Our Response to the Open Letter to the Open Data Community

  • the CKAN Association
  • 20 Sep 2018

On March 2017, the Civic Analytics Network (CAN) - a consortium of Chief Data Officers and analytics principals throughout the United States, published the Open Letter to the Open Data Community. It listed eight guidelines “that, if followed, would advance the capabilities of government data portals across the board and help deliver upon the promise of a transparent government.” Since then, a lot has transpired in the open data ecosystem — just in the past few months, CAN released an update to the Open Letter a year later (June 2018); the US Government published the initial draft of its Federal Data Strategy (July 2018); the US General Services Administration (GSA) issued an RFQ to build a US government-wide data platform based on CKAN (August 2018); and Google launched its Dataset Search service (Sept 2018). As the custodian of the CKAN project - the platform behind the largest data portals in the world, the CKAN Association celebrates this emerging consensus of Data as Infrastructure, and recognizes it needs to get actively involved in the conversation as the leading open source data portal platform.

Data is Infrastructure

Data is essential infrastructure of the Information Age. It is as important as our physical infrastructure. Much the same way the critical infrastructure of the Technology Ages were accelerated by the adoption of open standards (e.g. the standard gauge, mass production, the shipping container, the internet, the Web, etc.), we believe that our digital infrastructure should be built not just on open standards, but also with preferential treatment for open source. Beyond the reasons outlined in this paper - “Why Open Source Software Matters for Government and Civic Tech - and How to Support It”, doing so affords users the Freedom to collaborate and innovate within an ecosystem - of providers, entrepreneurs, funders, researchers, practitioners, standards bodies and governments.

Eight Guidelines for Open Data

As the third CKANcon at the International Open Data Conference in Buenos Aires draws near, the CKAN Association is sharing our long overdue response to the Open Letter:

1. Improve accessibility and usability of data to engage a wider audience.

Thanks to its open source foundation, hundreds of community-developed extensions and widespread use around the world, the CKAN ecosystem has advanced on all the key features and recommendations for this principle.

2. Move away from a single dataset centric view.

This is already the default behaviour of CKAN - with its notion of a Dataset Package with multiple Resources. And with the imminent release of Resource Queries in CKAN v3.0, data publishers can create complex SQL queries to create derived datasets.  For example, from a 311 master file, a publisher can automatically create different subsets of the data (e.g. by neighborhood; by year; join 311 graffiti/light outage with crime data, etc.), without having to upload and maintain these derivative datasets separately. Frictionless Data, another OKI project, also introduces the notion of Data Packages (think “Docker for Data”) which CKAN now also supports with the Data Curator editor, and the datapackager and validation extensions.

3. Treat geospatial data as a first-class data type.

Various experiments are being done in this space by several CKAN installations. Of note are the DataSpatial extension from UK Natural History Museum, where they support advanced spatial search using PostGIS and Solr; TerriaJS from Australia’s research agency - the Commonwealth Scientific and Industrial Research Organization (CSIRO); and the EU’s Next Generation Global Earth Observation Systems of Systems (NextGEOSS) Data Hub. Other geospatial experiments include configurable, batched-geocoding; integration with ESRI beyond harvesting; and native integration with Chicago’s OpenGrid. Beyond these, the CKAN v3 roadmap that will be released at CKANcon@IODC18 directly addresses making geospatial data a first-class data type.

4. Improve management and usability of metadata.

The Data Dictionary feature introduced in v2.7 goes a long way towards automating the creation of high-quality metadata.  By making it an integral part of the upload workflow, and taking it beyond just documenting the dataset schema to double as an administrative interface to manage the column data types of the underlying dataset, this feature directly improves the management and usability of metadata. The integration of the scheming extension in CKAN core for v3 should also make the creation and sharing of custom metadata schemas much easier. Working closely with organizations that consume metadata has also paid dividends. Google’s involvement, for instance, in adding schema.org JSON-LD support in the DCAT extension has resulted in CKAN support of Google Dataset Search when it was announced earlier this month, on day one.

5. Decrease the cost and work required to publish data.

There are multiple mechanisms to automate data publishing in CKAN.  There are free connectors for enterprise-class Extract, Transform & Load (ETL) tools like Pentaho Kettle and Safe FME.  There are also various API clients that can be integrated into just about any ETL automation pipeline - for Python, Javascript, Ruby, PHP, Java, Perl, R and the command-line.  There’s even a CKAN OpenRefine connector. And with the aptly-named Express Loader, CKAN can now reliably load large datasets exponentially faster.

6. Introduce revision history.

CKAN’s revision history is currently being revamped with an expanded Activity Streams feature that’s slated for release in CKAN v3.0.  It behaves more like an audit trail - allowing users to see not just old versions of datasets, but differences between metadata values between revisions as well! For a preview, check this presentation by David Read at CKANconUS.

7. Improve management of large datasets.

CKAN can now reliably work with large datasets.  Apart from Express Loader, contributions like the Apache Libcloud powered cloudstorage extension now allow CKAN to work with 50+ cloud providers to upload datasets up to 5 TB in size! And since the datasets are stored in the cloud provider’s infrastructure, we gain the added benefit of high availability and access to additional data processing capabilities offered by these providers (e.g. when using AWS, take advantage of services like Lambda to do serverless computing, or Athena for ad-hoc queries; when using Azure, take advantage of Azure storage filters to power performant interactive filters, etc.) Also, a frequent criticism of CKAN is that it is not multi-tenant. In our opinion, this is actually an advantage in this age of containerization. First generation Software-as-a-Service (SaaS) solutions were often slow because their shared-everything multi-tenant architecture often meant high utilization by one tenant (say, because of the release of a popular dataset, or the running of a hackathon) would impact the performance of another tenant.  Containerization also has the added benefit of true process, data and security isolation as several CKAN installations use it as internal data-sharing infrastructure (“open data inside”).

8. Set clear transparent pricing based on memory, not number of datasets.

The open source status of CKAN allows for a variety of service models to emerge. From SaaS to professional services models which support on premise installations with white glove support. That said, none of the existing CKAN-as-a-Service providers employ the “nickel-and-diming” per dataset pricing model as we want to incentivize, not penalize use. This extends to per-seat licensing as well - there is none.  Further, each CKAN instance has full access to their users’ profiles and have the ability to craft their own terms of service.  Contrast this with some proprietary data portal provider’s terms of service that assert very one sided grant of rights to the proprietary portal provider.  It is your data, they are your users - not the portal provider’s. Furthermore, your budget directly funds a lot of contributions to the CKAN ecosystem as all the existing CKAN providers contribute and sponsor core development — new features and innovations largely informed by the work they do with their customers. And since its open source - there’s no lock-in.  If pricing ever becomes an issue, publishers can export all their data and move to another CKAN provider, or their own CKAN instance, or adopt another solution entirely.

Conclusion

In its long history, the CKAN project has always been at the heart of the Open Data movement, driven by a wide community of users, developers and stakeholders. As the movement evolves and faces new challenges, we are keen on CKAN responding to these changes in order to keep serving the community. In that regard, CAN's Open Letter provides invaluable guidelines. We look forward to engaging CAN and other stakeholders - so we can build together the Data Infrastructure that will deliver the transparent, data-driven, 21st century government we expect and deserve.