The British government has taken the astounding decision to start using its own data. We investigate why it took so long, and required the assistance of a leading AI expert and the inventor of the World Wide Web

At the start of November, in a speech at the Open Data Institute Summit 2015, Matt Hancock, Minister for the Cabinet Office and Paymaster General, announced that the British government would not only be releasing its own datasets for use by companies and the public, but also be making use of this data for its own operations.

Presented as an innovative approach to data, the idea was characterised by Hancock as “dogfooding” – a reference to a pet food CEO who famously ate his own company’s product. The Open Data Institute (ODI), which helped develop the idea, was similarly positive.

shutterstock_255047029 (1)

“There’s a huge opportunity, we believe, not really fully implemented at all, for the government to consume its own open data,” said Nigel Shadbolt, chairman and co-founder of the ODI, in a press Q&A later that day. “It’s a really quite odd feature of this landscape that you’re publishing it, so what about consuming it? And using it to really increase public sector efficiencies; to improve public service delivery; to reduce the friction of information transfer between departments. I think that’s an area where a little investment there could really make for some dramatic returns.”

The idea is in itself not a bad one, but the fact that using your own data, which in some cases has been amassing for decades, is considered innovative is utterly weird. Surely something as basic as this should have been obvious to the UK government?

Write-only data

Save for the occasional check, much of the data previously collected by governments has effectively been write-only; it’s collected, but is mostly left to sit in filing cabinets locked away in head offices.

“Government data is a record of what’s happened, of what is. From ancient times it’s often recorded how much you’ve spent and on what,” said Hancock. “But while the data may have stayed the same, what it’s recorded on has changed. We’ve evolved from clay, to parchment, to paper and printing press, and now pixel.

“We’re in the foothills of a data revolution,” he added. “Data is no longer just a record. Over the last few years it’s acquired a very new set of characteristics.”

Describing data as “the unseen infrastructure of the digital economy”, Hancock said that it was both a mineable commodity and a form of property.

However, these features are also present in traditional forms of data: it is more the sheer quantity and the ability to use software to extrapolate from data that has made using data so achievable and yet so valuable.

Infrastructure challenge

if the UK government is to achieve its plans, it is going to need to spend money on its own digital infrastructure

The answer to why using data is such a significant step lies in the fact that it requires suitable infrastructure to support, and while many businesses and individuals utilise software to analyse their own data, transitioning the hulking behemoth that is an entire government takes some time.

During the Second World War, Bletchley Park, home of the Enigma code breaking machine, was an infrastructure hub for analysing data, most of it powered by brains rather than machines. But in peacetime government spending has never been high enough to radically overhaul how data has been handled, leaving different departments’ datasets totally unconnected and vital information out of reach.

Now if the UK government is to achieve its plans, it is going to need to spend money on its own digital infrastructure, something that may be easier said than done.

“First we have to modernise our data infrastructure. We need to get better at standardising and maintaining our data,” said Hancock. “We need to move away from governments’ reliance on bulk data sharing and create an economy of APIs. And as with every other aspect of government, we need data services built around the needs of users, not the internal logic of Whitehall.”

However, the government has given no clear timeframe of how it plans to implement the system, and given the previously abandoned NHS National Programme for IT, which sought to centralise patient data but was axed after costing £10bn, we’re not holding our breath.

Political dissonance

Governments are often left behind, stuck in slow, regulatory processes

The project wouldn’t be happening at all if it weren’t for the Open Data Institute, which recommended the government use its own data. This reveals a common problem in technology: governments are often left behind, stuck in slow, regulatory processes while the technology transforms each year.

It’s unlikely this project, even if it succeeds, will solve the problem unless radical changes are also made to the regulatory processes. However, Hancock does believe it could help make governments less top-down, which in theory could make them more responsive.

“This isn’t so ministers can micromanage from on high, but so the service managers, who deal directly with the users, can look for constant incremental improvements,” he said.

Expecting it to just work

However, perhaps part of the reason that using data seems like something that should have already been happening for years is that when we have open data, the various systems providing us with information “just work”.


“Just having open data makes life easier for everybody,” said Sir Tim Berners-Lee, creator of the World Wide Web and co-founder of the ODI. “When you came here you may have taken your phone out and wondered how to get here, and your phone may have offered you how to get here by car or bike or foot or public transport.

“I expected the same in another European city the other day and I selected public transport and a banner came up at the top and it said: ‘Warning, the data we have may not be complete’ – we may not be giving you the best route because we haven’t got all the necessary schedule data from all the necessary suppliers.”

“You expect the thing to work here in London, and it does, because of open data,” added Shadbolt.

Wikidata project to tackle language barriers in scientific research

A Wikidata project to make life sciences research data available across languages is set to boost research sharing in the international scientific community and ensure greater data consistency across Wikipedia pages in different languages.

WikiProject Molecular Biology is intended to be a centralised resource for data on everything from genetics to pharmaceuticals.

“Our aim is making Wikidata the central hub of life sciences data on genes, proteins, diseases and drugs and the relationships between them,” said Andra Waagmeester, semantic data expert and member of the WikiProject.

“Our approach is that we take a lot of resources on these topics and import them into Wikidata, and not only import them, but also maintain them through regular updates. This has two effects: research data that is closed in different data silos can now be used to populate Wikipedia articles, but also since Wikidata has a single API, scientists can start reproducing and sharing the information in Wikidata themselves.

“So you can see not only provides spreading out of scientific knowledge and providing proper scrutiny for scientists’ findings, but also it provides a strong infrastructure for scientists to share their own results.”


Language barriers can be a significant problem in scientific research, with discoveries sometimes failing to be disseminated and thus needing to be rediscovered in other countries.

An example given by Waagmeester is Arsenic trioxide, a chemical compound found in 2013 to be an effective alternative to supplemental chemotherapy in treating a certain type of leukemia.

“What’s intriguing about this story is that the paper was published in 2013 – the curative effect of the trioxide was already known about in China for decades,” he added.

“Only, it didn’t reach the English-speaking world due to two simple facts: the scientists who noticed it didn’t speak English, and the findings were published in a scientific journal that was even obscure for most Chinese readers.”

Had the research been more widely known, this alternative treatment could have been life-saving.

“It showed that having data closed can actually be bad for the treatment of other people. So now we argue that Wikipedia could solve that,” added Waagmeester.


There are also inconsistencies across the same Wikipedia pages in different languages about particular diseases and treatments, which the project hopes to solve. However, this problem is not unique to the life sciences – the population of Aruba, for example, differs depending on the language you view it in, an issue that the wider Wikidata project is set to tackle.

“The Wikimedia Foundation came up with another project that just had its first birthday last week, which is called Wikidata. Wikidata is a linked database that provides data that it is editable by humans and machines, as it is with Wikipedia articles. It’s actually the Wikipedia model to data,” said Waagmeester.

“Now we can actually start writing Wikipedia articles where both the Chinese audience and the English-speaking audience could actually mine data that’s available in Wikidata, and potentially breaching the language barrier on science.”