Anonymisation

The discussion started with an introduction to a background paper by the European Data Protection Roundtable. This is a great document that describes the state of European law and also of computer science.

Netflix paper, addressing social network data by Narayanan and Shmatikov’s.

The talk skimmed over some basics of how European legislations are created, but thankfully EDRI has an excellent Guide of Brussels Maze

Caspar Bowden gave a historical overview of data protection legislation in Europe since 1995, focusing on how a botched definition of what is personal data has created a colossal loophole. Caspar is writing a paper on this and would appreciate additional insights.

The European Data Protection Directive (DPD) was enacted within that period, but the process is not well known. When the draft DPD went through the EU parliament on February 1995, there were two separate articles, one defining personal data and one on depersonalised data. But when it came out after secret discussions, and we got current directive, there was only one article defining personal data, plus Recital 26 defining depersonalised data. Member states don’t have to translate recitals, and indeed some didn’t incorporate the same definition of depersonalised data. Caspar believes the UK government of the time introduced this change in order to weaken privacy protections.

What’s the effect?

Recital 26 states that data shall be considered anonymous if the person to whom the original data referred to cannot be identified by the controller, or by any other person. This last part was missed in the UK version. This has very practical implications. For example, Internet Service Providers (ISPs) allocate IP address to customers, these uniquely identify the computer in the internet, and are then recorded as people carry out activities online. The ISP can link an IP to a customer, but can others be assumed to know the identity of the real person behind the computer? IPs are seen as “pseudonymous” by some people, not fully personal information. This is a bit like saying your car plate number is not personal data because you need to ask the authorities to do the matching. So tracking your car’s movements is ok until I find out your name.

By 2007-8, IP addresses were considered personal data in many contexts and jurisdictions. This didn’t matter for most countries where Recital 26 was implemented as such in local law, but in UK it wasn’t, so this kind of data is treated as a different category.

This lack of a proper definition of anonymised data has made guidance from the UK Information Commissioner Office very mixed up. In 2012, the ICO published some anonymisation guidelines. These stated that pseudonymous data would be considered as anonymous and not personal data. This has a big impact on privacy. The Article 29 Working Party – which groups data protection authorities across Europe – issued an opinion on anonymisation-meant to be guidance for all EU countries. The paper runs to 30 pages, and it is very comprehensive. It says clearly that it is a cardinal mistake to regard pseudonymised data as anonymous data.

The paper goes on to describe 7 methodologies which can be used to anonymise data, but there is no general formula, just rather bleak guidance that can be best summarised in “you need lots of help and even then it’s hard to say”. Or in Caspar’s words: if you want to anonymise data you need to hire a Computer Science Phd who is clueful and do strong analysis of data and potential use! This isn’t the usual data protection guidance.

There is also confusion about meaning of pseudonymous data itself. There are 2 types that are supremely different, and any definition that blurs them is problematic:

(1) record level data, take away name and identifier numbers, and replace with simple index number (serial or random). so the data is linked but the index doesn’t have a meaning.

(2) other types of pseudonymisation

The question is: does the data controller keep a copy of the index, or is it deleted after it is pseudonymised? In most cases, you need the index for some use.

Article 29 WP says: if you want to protect privacy you should delete the index. Caspar recommends reading the docs, although tough computer science concepts, but very good.

Where does this all lead?

There has been a new general Data Protection regulation in the works for last 2 years. It is hung up on various issues. There is worry that the Commission has just sold out on anonymisation; it published the regulation without any pseudonymous data concept, and this time, put the definition of anonymisation in an article, where it belongs. So it will be interpreted correctly in each country.

But then the civil liberties committee had lots of proposed amendments, and chose to introduce the concept of pseudonymous data without a good definition. The Council of Ministers changed this to the process of pseudonymisation, rather than a definition. And sadly the European Commission went along with that rather than pushing for it.

Caspar will be happy to explain any of the above in more detail.

Questions / topic ideas / comments:

  • impact for UK if regulations enacted?
  • practical understanding of current situation and implications for those who are releasing pseudonymous or anonymised data, or who are attempting to anonymise data
  • what about permissions to access the index – e.g. you can get data with a court order if you are police but not accessible for most people.
  • what about practical measure to make it almost impossible for an ordinary citizen to get access the data?
  • regarding definition of pseudonymisation as a category, a strange midpoint between anon and identifiable. The process of pseudonymisation depends heavily on measures applied
  • would like to go to anonymity concept in a relative way, identifiable for you but not for the recipient. get away form these absolutes. there’s no room for technical or contractual measures for access to the index here
  • international transfer of personal data… many implications beyond consent
  • what extent does Scottish ICO-equivalent fit with UK vs EU?
  • is pseudo a meaningful and useful category, even if we understand anonymisation isn’t binary? better to stick with the scale of “fiction” we have today with anonymisation scale, than to introduce this extra category in the middle?? Perhaps the role of pseudonymous is on data CONTEXT rather than CONTENT? useful to draw attention to context maybe?

Remarks back:

The central concepts here are absolute vs relative identifiability. Until around 2007 there was a general opinion on the concept of personal data around relative identifiability. Meaning when you have a Data Controller, and this DC in theory is the only person able to re-indentify some data via an index: ‘then it’s not personal data as long as you take as much security as you can to ensure the index doesn’t escape’. But nowadays the debate has moved a lot further to the concept of absolute identifiability, where the controller cannot identify. [this is the 261 opinion]

The new regulation defines this as ‘reasonably likely to be…’ If everything, every cookie, is identifiable, how do you cope with this in internet context? in the past this was ignored. Now, it’s ‘reasonably likely’ or it’s personal data. BUT special exemptions may apply. For example, in the new Regulation the wording , as of last October, established that for research purposes you should try to anonymise, but if that wasn’t possible you could use pseudonymised data, or if there was important public interest you might get special exemption to use personal data.

Research data

Research community need to be able to access the index to do the research. New WP216 is adamant about index deletion but this defeats the purpose of the research protocol for, say, cancer registries or long term medical studies (longitudinal).

When you talk about reasonably likely identifiable, you mean 2 things (1) reasonable via data analysis, to get probablistic chance of identifying, or (2), where the index is removed and only accessible via say a court order. The new regulations are problematic because:

  • they says pseudo is still personal BUT have taken away all your rights as a data subject
  • if there’s a breach of pseudo, you notify the data controller, not the data subject.
  • also nullified right to access pseudo data, which his also the right of the data subject to correct the data. without the index, you cant CONTACT the data subject to give the subject their rights!

So it’s all really confused. either you know who the subject is but they have no rights, or the subject isn’t identifiable and so can’t have their rights.

So the 3 pillars of the EC are drifting towards this idea that pseudo data is personal, but have lost the rights associated with it. It’s a way of covering legal embarrassment.

Are there times it makes sense to use pseudonymisation?

Pseudo data only makes sense if there’s prior knowledge of one person

It’s a reasonable internal security provision. eg internal accounts in a system, for audit purposes., ut it doesn’t make sense as a release to third party control or risk limiter.

…but actual release is a whole other thing.

Policymakers have been conned into thinking pseudo techniques are a ‘new tech’ which solves the problems and makes it safe to release data. it’s been brewing in UK for 10 years or so. Wellcome trust etc. Now agenda of current UK govt – ‘if you pseudonymise it, no problem’

ICO knew about this stuff ages ago… [Caspar admits he’s deeply involved so isn’t objective!]

Mark Elliot is running a mailing list and new project spun off from ICO code, a body of knowledge about pseudonymous data. Kieran proposed larger research scientific centre of expertise. But the UK government hasn’t chosen to do this substantively

ICO is in tough spot now as a result.

What are the practical risks of potential future problems?

  • data controller database compromised, index gets into public domain
  • future tech is better than current and allows reidentification

Is pseudo still better than releasing personal data?

Using the ‘balance’ term suggests there is a spectrum with 2 ends and you pick a point. but in fact it’s complex and there’s no magic sweet spot it is a good excuse – path of least resistance to argue there’s a nice balance point.

Other risks:

  • social steganography
  • eg data collected… countries with civil war, factions may use a dataset to identify a social structure around a data point
  • shown time and again people can be identified with some success

There are two notions of risk – social risk and security/tech notion of risk (if vulnerability risk is 100%)

At what stage has one ‘done enough’? may revise judgement in future of course

  • not just that you can do an attack but it’s plausible.
  • is a question: does a motivated intruder exist
  • of course that will depend on the ‘target’ individual – some more at risk than others
  • level of diligence of a private sector DC to a public sector DC may differ?
  • motivations different
  • penalisations of private sector reckless action – none known so far

Approaching the uncertainty

  • what is data? what is personal? bad situation! very few people understand this. need to go deeper in discussion human rights issue – treaty of Rome, what UN is doing. Wider discussion of humanity
  • internet architecture is destiny
  • tech & policy are 2 separate worlds — need more overlap need global discussion

3 ah-ha moments:

  1. legal and tech world intersection
  2. how the discrepancies arise between UK EC and other positions – how it all works out – legislation in flux and in practice different/. hadn’t realised 4 words made so much difference and that this means different countries have different setups
  3. pseudonymisation seen as comfort blanket by policy makers removing need to think about complex landscape of anonymisation

WP216 document – very radical! mostly because DP officials often not clueful technically (30 out of 1500 with any CS skills??)

Lobbying from US companies trying to kill it. EU Commission trying to set up carrot and stick. Simplifying stuff is carrot, but stick is all-embracing personal data, with exemptions to make it acceptable.

What are other implications?

  • for data subject as above
  • for DC in terms of index removal as above
  • “the DC should not have to collect any more data to give effect to regulation” – what does that mean? sounds harmless but is fatal! eg your Google clickstream when not logged in. Indexed by cookie and IP address. you go ask Google and you provide your cookie. If Google get that request, they can’t guarantee that cookie is yours, and so they cannot give out your data because there’s extra info that is necessary…. unless you can strongly prove the data is yours. Nullifies right to access or correct or delete your data ! they should have said “you should offer the data subject a strong data authentication secret’ (just for this purpose)

if you could access your pseudo data, that might be useful for you, but depends DC being willing to give it to you.. Article 10 removes that.

implications for private data sharing and third parties? none in particular

hashes as indexes a hash is a number which is a mashup of some data

if you care about this, please beat up Ministry of Justice and also lobby your MEPS

Cabinet Office should have set up a scientific centre of excellence on this — they didn’t so we have a community group, better than nothing. but they are data consumers not privacy people. Funded by ICO. stepped in with money when no other money was there, and they didn’t have to but at least they tried and no one else did

Crash course in differential privacy:

It’s hard! Invented 2007 by Cynthia Dworak, who is a genius. geared towards a national research centre, physically restricting data location, heavy duty data research around this.

Anonymisation – classic technique – perturb data whist keeping statistical properties the same. usually do that by looking at whole data set and mixing it up from there. application of Differential Privacy (Diff Priv): imagine formulating your stats query and asking it of this database. the system looks at your query and automatically works out the optimal distribution of noise to give best privacy for privacy bar whilst giving you the right stats result. it allows for a set EPSILON level of privacy protection, so when you fire in these queries you get a fixed privacy budget. once you’ve asked some set of queries you have used up all your privacy, and then you have to throw it away. This is like alien weirdness to policy people! too hard to explain paradoxical consequence.

Example in practice: if you make some info available it PREVENTS some other info from being released later! because you only have so much privacy to use up if you pseudonymise the data it’s hard to predict identifiable stuff is, but with Diff Privacy, as long as you are within your epsilon, you know it’s OK, regardless of what happens later.

Note that interactive scenario is less cheap 🙂

Individual guaranteed diff privacy: if you are a part of the dataset or not, the info revealed about you is the same! the answer is statistically very close.

This is NOT an anonymisation technique.

This limits inference attacks (that’s where you ask about broken legs then about appendicitis then about diabetes and the set of data you get overall lets you infer stuff)

How can you institutionalise this? the epsilon knob needs some risk calibration! Very tough. Helps if you keep some hold out data which you can use later for model validation. Questions being asked in Computer Science now: can you do distributed Diff Priv? can you do diff priv on data streams?