You are browsing the archive for Featured.

Why privacy considerations matter in the current open data environment

- April 29, 2014 in Featured

In a little under 2 months, experts in open data, privacy, and personal data management will gather in London to spend two days deliberating on issues surrounding the privacy concerns of opening up data systems that might contain elements of a personal nature.This meeting could not have come at a better time. This group of people hitherto have not had many opportunities to interact, especially since open data communities have been preoccupied with non-personal data (specifically data that is of a public nature). However, one finds that with time, the data being collected and opened increasingly run the risk of containing identifiers which make identification of individuals possible. Reasons for this include the need for these data systems to respond to transparency and accountability imperatives- emphasizing those unavoidable tensions between openness and privacy. But most often, it is simply because anonymization fails. These issues are highlighted in previous posts to this forum.

Consequently, anonymization itself has become a contentious issue in the arena currently, with many privacy experts raising the question of it is infact any effective in protecting consumer privacy. A recent ODI Friday lunchtime lecture by Ross Anderson highlights several instances when anonymity has failed in health data. In the world of geodata also, the likelihood that privacy concerns could be violated through open data systems was flagged in a recent blog post. The author demonstrates how by applying a variable degree of data mining effort, one is able to de-anonymise bicycle journeys data from a publicly-available dataset from Transport for London.


Other relevant issues are raised. For example, there is the issue of uncontrolled data mining which increases the risk of re-identifying anonymised data often by linking several datasets. There are also the risks of some cross border transfers of personal data that violate Principle 8 of The Data Protection Act which states that “Personal data shall not be transferred to a country or territory outside the EEA unless that country or territory ensures an adequate level of protection for the rights and freedoms of data subjects in relation to the processing of personal data”. There is a need for more clarity on the problems surrounding the global flow of these problematic datasets.

Additionally, the issue of what are the boundaries to defining the key data terminologies: ‘personal’, ‘anonymised’, ‘transformed’, ‘aggregate’, ‘mydata’, and ‘pseudonymised’ keeps coming up in the debate. At a recently-held workshop by the My Data Working Group of Open Knowledge Finland, the need for a broader consideration of the concept of ‘my data’ was apparent from the discussions. Additionally, from the discussions led by keynotes from Nils Torvalds and Mydex’s William Heath, participants agreed that the lessons learned from the advancements in the UK’s development of Mydata could lend perspectives to developing a similar strategy for Finland. The need for these principles to be applied from a global context point of view is therefore reinforced by the international outlook of this London meeting.

However, it appears from recent debates from the Helsinki meeting, as well an on the My Data WG mailing list forum, that some practitioners are questioning not only the definition of the term ‘my data’ but also the necessity of using it to distinguish from certain aspects of personal data. The London expert meeting offers further opportunities to debate this further, as one of the goals of the working group is to have a working document (much like that of the Open Definition) which clearly defines these terms to streamline how we communicate on them.

Crucial to debates on mydata systems, is the issue of what sorts of controls data subjects can have and want to have over the data held on them. Managing own data can be time-consuming, technical and is often not straightforward. The current options of opt-out (for example as used in the controversial scheme
Blogpic2 need to be weighed against schemes that offer opt-ins. Often, options of giving consent are broken (due to one reason or the other) and there is the need to investigate what alternative forms of control are available to data subjects.

As a response to these myriad of issues, the participants at the meeting are tasked with proposing interventions that tackle tools, policy and data literacy gaps through capacity-building, tools development and communications activities. They will therefore carefully review the efficacy and applicability of some of the already-existing tools: for example, use of consent-receipts, datenbriefs, and the proposals for co-regulation (by data subjects and data publishers) in managing privacy concerns. Additionally, they will lay out principles to govern behaviour of data publishers. Among other things, the principles will propose standards that need to be kept when anonymising and aggregating data to minimise risks of re-identification. The principles will also include a checklist for data publishers to guide their decision to open up a particular dataset or not. Overall, the goal is to come out this meeting with an outline of specific interventions taking into consideration the different interests and capacities of those already undertaking activities in this environment..

In the weeks leading up to this meeting, the WG will continue to engineer critical discussions on the dedicated mailing list, wiki page and on the Twitter forum, so do visit these pages to contribute your thoughts.

What do they know about me? Open data on how organisations use personal data

- March 18, 2014 in Featured

This post is by Reuben Binns, a postgraduate researcher at the University of Southampton, Web Science Institute. His research interests include ethical and legal aspects of personal data and open data. Find him on Twitter and GitHub.

When open data and personal data collide, attention is quite rightly drawn to the negative implications for privacy; namely, the possibility that open data contains – or can be used to infer – personal data. But there’s also a flip-side; open data could help protect privacy by revealing the activity of those who collect and share our personal data. This is something I’ve been exploring in my research using the UK Register of Data Controllers.

This dataset, covering the data protection notifications of 350,000 UK organisations, is released by the Information Commissioner’s Office under an Open Government License (it’s available by DVD on request from the ICO, and can be searched using their website portal). It discloses why organisations collect personal data, what kinds of data they collect, from whom and who has access to it. My research uses snapshots of this data over a 3 year period to paint a picture of the UK personal data landscape – who knows what about who, and why. Of course, some of this data may be inaccurate or incomplete, but it’s compiled from what organisations themselves are legally obliged to disclose to the ICO. The raw XML was parsed and loaded it into a database which can be queried. The full results will be released in a forthcoming paper, but alongside this, I’ve also been experimenting to see how the data could provide context to some of the privacy stories that have been in the media spotlight in recent years.

One example is the ongoing ‘construction worker blacklist’ fiasco. The Consulting Association, a rather blandly named outfit, were fined by the ICO for compiling a blacklist of over 3000 construction workers. Employers paid for access to the list in order to screen out potential workers who had previously caused ‘trouble’ – by, for instance, raising safety concerns on site or engaging in trade union activity. Some of the blacklisted workers were unable to find work for years and are now seeking compensation.

What’s ironic – and alarming – about this case and others like it is that the potentially harmful activity often isn’t itself prohibited by law. In the end, the £5,000 fine was issued due to the Consulting Association’s failure to register their activity with ICO. The truth is, even legal activity that regulators are aware of may still endanger privacy. So I dug into the register to find companies openly claiming to engage in similar practices.

I found 422 organisations who claim to be collecting information about the trade union membership status of employees of other organisations, for the purposes of selling it to third parties. This was essentially the business model of the now defunct Consulting Association. I’ve visualised a sample of 42 of these organisations below – the yellow nodes are the categories of third parties with whom they share this data.

See full image here.

A more recent controversy concerns the use of patient health data. In the debate over the proposed scheme – under which medical records currently held by GP’s would be aggregated into a central database and made available to researchers and companies outside the NHS – it emerged that identifiable patient data from hospitals has apparently already been sold (indirectly) to insurance companies, to the shock and dismay of privacy campaigners and health professionals alike. The body responsible, the HSCIC, have an entry in the register stating who they share personal data with – a copy of which can be seen by searching their registration number (Z8959110) in the ICO’s public portal. (NB: no mention of insurance companies).

A query for organisations who are collecting health data for ‘health administration and services’ purposes returns over 57,000 results. We can refine this to show only those organisations who give this data to ‘traders in personal data’, which yields 840 matches. Many of these appear to be opticians – branches of ‘Specsavers’ make up about a third – so if you’ve had an eye test lately, the results have possibly been aggregated up and sold through third parties. But there also appear to be some other health providers in there with potentially more sensitive data; one of them is an NHS Trust specialising in mental health. There may be a perfectly legitimate and ethical reason why they’re giving away patient data to private data brokers – but I’m struggling to guess what that could be.

Real privacy harms could result from these kinds of data sharing arrangements, even when they don’t contravene data protection law. If I were a member of a trade union, and my employers had any relationship with those 422 companies, I’d want to know about it. If I were a user of an NHS mental health service, I’d want to know if they’re sharing my medical data with data brokers and why. Whether it’s employment history, political affiliations, or health records, authoritative and accurate open data on who knows what about who is a pre-requisite for preventing privacy harms before they arise.

Publishing this information in obscure, unreadable and hidden privacy policies and impact assessments is not enough to achieve meaningful transparency. There’s simply too much of it out there to capture in a piecemeal fashion, in hidden web pages and PDFs. To identify the good and bad things companies do with our personal information, we need more data, in a more detailed, accurate, machine-readable and open format. In the long run, we need to apply the tools of ‘big data’ to drive new services for better privacy management in the public and private sector, as well as for individuals themselves.

So while there are genuine tensions between openness and privacy, there are also harmonies. When it comes to the organisations, businesses and institutions that shape our lives and livelihoods, transparency about how they use our personal data is essential. It’s the first step towards a new privacy infrastructure fit for the digital age – and open data has a crucial part to play.

Further links:
See the github project report for more on the data source itself – contributions / forks are very welcome. See my previous thoughts on how openness can help rather than hinder privacy here and here, and my musings on the scheme shortly before it was postponed.

Open Data Privacy

- December 13, 2013 in Featured

“yes, the government should open other people’s data”

Traditionally, the Open Knowledge Foundation has worked to open non-personal data – things like publicly-funded research papers, government spending data, and so on. Where individual data was a part of some shared dataset, such as a census, great amounts of thought and effort had gone in to ensuring that individual privacy was protected and that the aggregate data released was a shared, communal asset.

But times change. Increasing amounts of data are collected by governments and corporations, vast quantities of it about individuals (whether or not they realise that it is happening). The risks to privacy through data collection and sharing are probably greater than they have ever been. Data analytics – whether of “big “ or “small” data – has the potential to provide unprecedented insight; however some of that insight may be at the cost of personal privacy, as separate datasets are connected/correlated.

Medical data loss dress

Both open data and big data are hot topics right now, and at such times it is tempting for organisations to get involved in such topics without necessarily thinking through all the issues. The intersection of big data and open data is somewhat worrying, as the temptation to combine the economic benefits of open data with the current growth potential of big data may lead to privacy concerns being disregarded. [Privacy International]( are right to [draw attention to this in their recent article on data for development](PI), but of course other domains are affected too.

Today, we’d like to suggest some terms to help the growing discussion about open data and privacy.

Our Data is data with no personal element, and a clear sense of shared ownership. Some examples would be where the buses run in my city, what the government decides to spend my tax money on, how the national census is structured and the aggregate data resulting from it. At the Open Knowledge Foundation, our default position is that our data should be open data – it is a shared asset we can and should all benefit from.

My Data is information about me personally, where I am identified in some way, regardless of who collects it. It should not be made open or public by others without my direct permission – but it should be “open” to me (I should have access to data about me in a useable form, and the right to [share it myself, however I wish](mydata) if I choose to do so).

Transformed Data is information about individuals, where some effort has been made to anonymise or aggregate the data to remove individually identified elements.


We propose that there should be some clear steps which need to be followed to confirm whether transformed data can be published openly as our data. A set of privacy principles for open data, setting out considerations that need to be made, would be a good start. These might include things like consulting key stakeholders including representatives of whatever group(s) the data is about and data privacy experts around how the data is transformed. For some datasets, it may not prove possible to transform them sufficiently such that a reasonable level of privacy can be maintained for citizens; these datasets simply should not be opened up. For others, it may be that further work on transformation is needed to achieve an acceptable standard of privacy before the data is fit to be released openly. Ensuring the risks are considered and managed before data release is essential. If the transformations provide sufficient privacy for the individuals concerned, and the principles have been adhered to, the data can be released as open data.

We note that some of “our data” will have personal elements. For instance, members of parliament have made a positive choice to enter the public sphere, and some information about them is therefore necessarily available to citizens. Data of this type should still be considered against the principles of open data privacy we propose before publication, although the standards compared against may be different given the public interest.

This is part of a series of posts exploring the areas of open data and privacy, which we feel is a very important issue. If you are interested in these matters, or would like to help develop privacy principles for open data, join [the working group mailing list]( We’d welcome suggestions and thoughts on the mailing list or in the comments below, or talk to us and [the Open Rights Group](, who we are working with, at [the Open Knowledge Conference]( and other events this autumn.

My Data & Open Data

- December 13, 2013 in Featured

The Open Knowledge Foundation believes in open **knowledge**: not just that some data is open and freely usable, but that it is **useful** – accessible, understandable, meaningful, and able to help someone solve a real problem.

A lot of the data which could help me improve my life is data about me – “MyData” if you like. Many of the most interesting questions and problems we have involve personal data of some kind. This data might be gathered directly by me (using my own equipment or commercial services), or it could be harvested by corporations from what I do online, or assembled by public sector services I use, or voluntarily contributed to scientific and other research studies.

Tape library, CERN, Geneva 2

Image: “Tape library, CERN, Geneva 2″ by Cory Doctorow, CC-BY-SA.

This data isn’t just interesting in the context of our daily lives: it bears on many global challenges in the 21st century, such as supporting an aging population, food consumption and energy use.

Today, we rarely have access to these types of data, let alone the ability to reuse and share it, even when it’s **my data**, about just me. Who owns data about me, who controls it, who has access to it? Can I see data about me, can I get a copy of it in a form I could reuse or share, can I get value out of it? Would I even be allowed to publish openly some of the data about me, if I wanted to?

**But how does this relate to [open data](** After all, a key tenet of our work at the Open Knowledge Foundation is that personal data should **not** be made open (for obvious privacy reasons)!

However there are, in fact, obvious points where “Open Data” and “My Data” connect:

* MyData becomes Open Data (via transformation): Important datasets that are (or could be) open come from “my data” via aggregation, anonymisation and so on. Much statistical information ultimately comes from surveys of individuals, but the end results are heavily aggregated (for example, census data). This means “my data” is an important source but also that it is essential that the open data community have a good appreciation of the pitfalls and dangers here – e.g. when anonymisation or aggregation may fail to provide appropriate privacy.

* MyData becomes Open Data (by individual choice): There may be people who want to share their individual, personal, data openly to benefit others. A cancer patient could be happy to share their medical information if that could assist with research into treatments and help others like them. Alternatively, perhaps I’m happy to open my household energy data and share it with my local community to enable us collectively to make sustainable energy choices. (Today, I can probably only see this data on the energy company’s website, remote, unhelpful, out of my control. I may not even be able to find out what I’m permitted to do with my data!)

* The Right to Choose: if it’s **my data**, just about me, I should be able to choose to access it, reuse it, share it and open it if I wish. There is an obvious translation here of key [Open Data principles]( to MyData. Where the Open Definition states that material should be freely available for use, reuse and redistribution by anyone, we could think that my data should freely available for use, reuse and redistribution by **me**.

We think it is important to explore and develop these connections and issues. The Open Knowledge Foundation is therefore today **launching an Open Data & MyData Working Group**. Sign up here to participate:

This will be a place to discuss and explore how open data and personal data intersect. How can principles around openness inform approaches to personal data? What issues of privacy and anonymisation do we need to consider for datasets which may become openly published? Do we need “MyData Principles” that include the right of the individual to use, reuse and redistribute data about themselves if they so wish?

## Appendix

There are plenty of challenging issues and questions around this topic. Here are a few:

### Anonymization

Are big datasets actually anonymous? Anonymisation is incredibly hard. This isn’t a new problem (Ars Technica had a [great overview][ars] in 2009) although it gets more challenging as more data is available, openly or otherwise, as more data which can be cross-correlated means anonymisation is more easily breached.

### Releasing Value

There’s a lot of value in personal data – [Boston Consulting Group claim €1tn][ftvalue]. But even BCG point out that this value can only be realised if the processes around personal data are more transparent. Perhaps we can aspire to more than transparency, and have some degree of personal control, too.

### Governments

Governments are starting to offer some proposals here such as “MiData” in the UK. This is a good start but [do they really serve the citizen][TH1]?

There’s also some [proposed legislation][midatalaunch] to drive companies to give consumers the right to see their data.

But is access enough?

The consumer doesn’t own their data (even when they have “MiData”-style access to it), so can they publish it under an open licence if they wish?

### Whose data is it anyway?

Computers, phones, energy monitors in my home, and so on, aren’t all personal to me. They are used by friends and family. It’s hard to know whose data is involved in many cases. I might want privacy from others in my household, not just from anonymous corporations.

This gets even more complicated when we consider the public sphere – surveillance cameras and internet of things sensors are gathering data in public places, about groups of independent people. Can the people whose images or information are being captured access or control or share this data, and how can they collaborate on this? How can consent be secured in these situations? Do we have to accept that some information simply cannot be private in a networked world?

(Some of these issues were raised at the Open Internet of Things Assembly in 2012, which lead to a [draft declaration][iot]. The declaration doesn’t indicate the breadth of complex issues around data creation and processing which were hotly debated at the assembly.)

### MyData Principles

We will need **clear principles**. Perhaps, just as the Open Definition has help clarify and shape the open data space, we need analogous “MyData” Principles which set out how personal data should be handled. These could include, for example:

* That my data should be made available to me in machine-readable bulk form
* That I should have right to use that data as I wish (including using, reusing and redistribution if I so wish).
* That none of my data (where it contains personal information) should be made open without my full consent.