Open Data: stop making a mess of it

All across the world governments are releasing increasingly more open data.  Developers still end up frustrated putting open data to proper use.

Open data is the open release of information, specifically for computer programs, without any restrictions on re-use. It seems to be too simple, but often governments mess this up.

By Lex Slaghuis, CTO Open State Foundation

Without no restrictions?

Releasing open data confronts governments with the situation that `€˜third parties’ suddenly can do A-N-Y-T-H-I-N-G with their `€˜data’. A recipe for disaster? `€˜Third parties’ couldn’t be properly equipped to handle T-H-E-I-R delicate data? Before you know it, everybody else starts making a mess of it. Well, Joy’s Law states that  `€œNo matter who you are, most of the smartest people work for someone else.`€ Unfortunately governments do not refrain from putting restrictions on use, by means of a license or contract. This seriously hinders the re-use of open data for three reasons:

1. Less re-use

First of all specific use cases of open data are forbidden by means of license and contract. For instance, the contract terms of the Dutch National Railroad service state: `€˜User will take care that he/she and third parties will only use the API directly or indirectly for informing travellers about future trips with public transportation.’

I am not a legal expert, but it seems to me that using public transportation data for calculating average delays is not allowed. These are the typical terms that eliminate the true benefit of open data: the unlimited variety of applications that are possible.

2. No combined re-use

A license or contract often has viral side effects that prevent combining and filtering multiple datasets. While combining data, the question is if there are any legal conflicts between the licenses and what would be the minimum allow legal frame. Because of course this is really difficult to tell, data can not be put to proper use.

3. No international re-use

Also innovation and technology is a game that is best played globally and that is not limited to country and language borders.
How do I know if I can use a certain dataset from the Chinese government if I can’t read (legal) Chinese?  Excessive use of local licensing limits international re-use, because even if I can put it to proper legal re-use, how would I find out? This is frustrating for companies like Factual.

Open data without rights

It is a question in the Netherlands if such closed / local licenses would actually hold up in (Dutch) court due to the FOR laws, but if data releasing parties think there is a violation of terms, it certainly doesn’t improve your relationship with the data providers. This is a real problem, because as a developer I am not only dependant on available data right now, I also want to have the data in the future. Unfortunately there still is no fundamental human right on open data.

There is only one proper way for governments to release open data, and that is by exempting from licenses and other restrictions. Creative Commons have made the Zero mark exactly for this. So let’s use it!

No open release

The government accidentally also introduces other barriers in putting open data to re-use. There are governments that ask `€˜re-users’ to identify before they can get access to the data. There are governments who are curious who is using it, want to monitor access to the data or manage the IT capacity for `€˜open data’. Remarkably, I have never heard NASA, provider of one of the biggest and most used datasets in the world, complaining about IT capacity. Open data is only truly open if I (and anything else on the web) can get it by using just one web-link.

4. Identification is an unnecessary limitation of privacy

These limitations raise a fundamental question as well. Why shouldn’t I be able to use government data anonymously? I surely don’t need to ID myself while visiting a city website? It’s none of `€˜regular’ governments business to know what I am doing. So why should the government collect personal information on the re-use of open data?

5. Identification doesn’t scale

Oh well, as an open data developer you won’t get far if you have principles. But putting open data behind user credentials and keys also doesn’t scale. Registering an authentication credential is of course not a problem, until you have to do this at 400 city websites. A complete waste of time! Building a global App with this data makes this a ludicrous undertaking.

6. Identification and licenses hinder distribution

But hey, as a nerd, I could ID and get this data for my country. I could share this data with colleagues across the globe in a shared database (github), if it wasn’t for the licenses that prohibit further distribution. So you want to make a change for the world, it is still not happening. The only solution is to put these data open and leave part of the distribution to others, who don’t make a mess of it.

Not for computers

Nowadays most public servants understand that hiding Excel data in PDF files is really not open data. But using the proper file formats would still change the world ..  An example is CIBG (Dutch health information organisation) that releases open data only in Excel en SPSS files.

As a rule of thumb, it is way easier to write a programme that uses open file formats, than opening up closed file formats.

7. Computer programs suck at bureaucracy

It can be even worse, by requiring a registration or the creation of an access key. It may come as a surprise, but computers are really bad in -correctly- filling in papers or online forms with personal information. So forget about applications that crawl the web searching for interesting data, automatically data and put it to re-use. Sounds like a magic? The sematic web  (1996) by Tim Berners Lee is built on the principle that data is everywhere – and nowhere – at the same time.

8. API’s restrict access to data

IMG_1121In the old days a lot of government data were extracted from websites by webcrawlers, but nowadays you still need to have the hacker skills. Why? Well, the government takes example in the corporate world where web platforms like Flickr, Amazon and Facebook lead the way with sexy API’s.

An Application Programming Interface (API) is an access point information that has been designed as a service. But our government doesn’t work the same, as the corporate world. Companies supply services and can design an entrance for computers programs so applications can be built upon them.

The government is not a company

For our governments the situation is the other way around. Our governments need to serve the public and society as a whole. That means that cases of re-use of open data, with civic value, which are not solely the extension of a public service, also are worthwhile. Government data should not introduce technical difficulties for re-use. This is directly opposite to the corporate world, where people want to -and should- align re-use as much as possible along company interest.

One could wonder, why is a sexy API a form of restriction? API’s provide an entrance to information. As with every entrance, during construction choices are made. Should it be a revolving door or a sliding door, and what should the width and height of the door. These design choices dictate re-use and determine the convenience with which open data can be acquired and put to use.

Flickr as a company with an API

You can ask the Flickr photo database to give a picture and to provide all its metadata. What is not possible is to ask all the names of the filenames in the Flickr database. From the point of view of Flickr an understandable choice, their competitors would copy their database in a hearts beat! But the government should not have any reason to do this. The moment governments design API’s, there are guilty of making design choices that influence re-use.

9. API’s introduce unnatural scarcity

The Dutch national road organisation has an Azure API for their vehicle database. This RDW’s `€˜service’ enables people to see which cars got removed from the road. So to identify exported (i.e. deleted) oldtimers, you will have to compare the entire database with a previous version. It is not possible to download all vehicles at once, through the API this process is divided in small batches of for example 200.000 pieces. The result: communication between apps and the database that takes hours and hours to update nine million vehicles. This is where IT-capacity problems arise and consequently they (now rightly) want me to identify (see arguments 4,5 & 6).

Aaarghhh! The open alternative would imply downloading a compressed copy of the entire database in over a few minutes. Moreover, this copy will be on your own computer, which allows you to continue your work comfortably while the government is shutting down. So open data: Let’s stop making a mess.

This blogpost by Lex Slaghuis was published earlier on