Machine-learning and council information: how to label council data through machine-learning techniques

Nowadays more and more municipalities publish their council information, such as minutes and council proposals, as open data. Although this is a positive development, more work needs to be done to increase the accessibility of these documents for journalists and other stakeholders. It is especially difficult to work with council data due to the lack of structure. The data, whether it’s proposals or other records, often do not have a labels that specify the theme of a document. Well-informed people know that it is not an easy task to create a data standard for at least 380 municipalities. The question is therefore whether machine-learning techniques could give a helping hand in bringing order to chaos.

Axel Hirschel, a data science student at the University of Amsterdam, wrote his master thesis at the Open State Foundation on this topic. His research focused on whether it was possible to automatically label documents. This would benefit the stakeholders, as having proper labels for the documents would increase the search-ability of the data.

In his Dutch Open State Foundation blogpost Axel describes the key pain points of conducting his research. A main difficulty was finding a data set to train the algorithms. A municipality data set that was already pre-labeled was not available. Rather, Axel had to use a data set with labels belonging to the Dutch parliament. However, one of the difficulties with this potential solution was the terminology. Municipalities do not use the same phrases as the parliament to describe problems. This is where the difficult part sets in.

In his thesis Axel elaborates on possible solutions to automatically label municipality documents. A possible solution, that Axel looks forward to examining in the future, is to create a pre-labeled dataset that could then help to train his algorithms.

Interested in reading more about his research? Check out his more elaborate Dutch article here, his thesis here and if you have any specific questions do not hesitate to contact him through LinkedIn.