Anonymisation through aggregations

Discussion points:

Open data which is about say whether it is fine to open up – no personal element. But what about when we have a set of data with information about people, can we open it up and if so how?

How does aggregation work? there must be some degree of ‘collapsing’.

If you have a 2×2 table, you can turn that into a record-based system?  There’s no real distinction between the two. Being table vs record doesn’t matter.

From the theory – there’s k-anonymity: you can collect data until there’s enough individuals within each group that sufficient aggregation is achieved.

But what about practice?

k-anonymity works well when database is low dimensional – few columns, many rows. but if there’s lots of columns this fails.

You can collapse some columns, sure but then we must talk about the distance between the columns! Closest neighbour and farthest neighbour may be very close.

Example: Dr Foster medical dataset. It is aggregated, anonymised, to the last 2–3 digits of postcode. Licensed to people who can provide services, authorised organisations. The threat model is very different here. It’s not an active adversary in attack; it may be an honest but curious person.

Individuals desire to give data for medical research but not have it totally identifiable.

The degree of collapsing renders the use negligible. Run the risk of collapsing the big numbers along with the small ones. There is no magic k – it’s all contextual.

If you collapse too much – e.g. national statistics – you can answer some questions but not others.

Can you work out the best questions to answer with a dataset and collapse in that way?

But with open data the benefit comes from use you can’t predict – if you knew the best way it would be used you wouldn’t need to open, you could just give the data to the user!

Think of the census. Huge stats effort to get personal data ready for open release. Worth investment – lots of minds to aggregate appropriately.

Changes in census method in UK coming; 10% sample release not 5% and some removal of fields of that.

Some licence conditions – but very light touch – on the sample data reduced set of variables is published – useful as teaching/training dataset also, record swapping, to create uncertainty & create noise.

So two ways: collapse data; or remove fields(columns)

Then we look at how sensitive the data/variables are.

e.g. abortion data…

Need to keep data useful but add perturbations. in census there’s nothing super sensitive; there’s health but it’s self-assessed (e.g. ‘are you a carer’ )

Lots of uncertainty and write in answers – processing not always perfectly accurate etc, transcription and translation.

Can try to add records to cover expected abnormalities.

Record swapping and other forms of perturbation – when you’ve done this you release samples of records, and final stats database.

Use of Netflix database in attacks on other things, the noise was added to netflix but not enough.

Knowledge of information may be imperfect anyway – but even with that you can use Netflix set to identify people.

Two different datasets – with various data processes – knowledge is imperfect – that’s different from deliberate perturbation (Statistical Disclosure Control).

These methods can weaken the results but could still leave potential for attack.

We haven’t seen these datasets released for study by security researchers – we haven’t tested the anonymised set. To carry out SDC tests you need real data – can happen inside an organisation. External researchers are different.

Ethics committee tough – so it’s an internal process to audit as well as to anonymise; no real validation. no way for external experts to assist /audit?


How can you evaluate anonymisation techniques using a meta anonymised data set? there are guidelines you can use for this depending on risks in data set

How can these methods be tested in a privacy friendly manner? there are ways to bring in external researchers to evaluate disclosure risk

Datasets reproduced after perturbation may have some gaps, some info. strongest test – you should not be able to find the person in the raw dataset and in the processed one ? But record swapping, it only applies to some proportion (?)

A practical attacker might be interested in one neighbourhood, gather data on that, and then get some set of attributes from that… depends on data how a practical attack would proceed.

The info gathering based attack has been tested by UKAN predicated on specific sort of knowledge ‘level 1 response knowledge’ – you know the person is in the data.

Attack could target a group rather than an individual

Underlying models always depend on whether or not you know for sure if a specific target individual is in the data.

What’s the context of this data? what other data is in the world or available? this informs attack modelling.

You can only simulate one attacker – not the set of several/all attackers…

In primary risk analysis you assume the attacker has exactly the data you have.

What threats are considered? ‘can i identify an individual and find out some specific info about them’.

False positives… for some matches there’s greater certainty the match is correct if you have high priority matches, you then retrain your model and repeat

You can have a theoretical limit with regards to a specific attack model. that’s estimable accurately. but it depends on assumptions about adversary

Our original question: *if we have personally identifiable info, and you want to release as open data, without (much) potential for identification, what do you have to do? * (assumming it’s not about data where you want to be able to identify!)

You must decide your risk appetite

Is it different if you ask people ‘are you willing to contribute your dataset’? – what if you get a subset of data, just a few folks contribute? changes aggregation picture .

It’s different for every dataset! Data about abortion vs census type data vs what’s your favourite colour?

Netflix – you can stop putting more data out but you can NEVER retract data.

Netflix – trouble in communicating this is: you may say, hey, my movie data, no worries i’ll share. but people never thought that it would lead to inference of sexual identity! really hard to explain risks here.

Attack audits are about certainty of identification rather than probability of identification.

3 ah-has:

  • hard to do really good auditing because you need to bring an adversary researcher in house to test anonymisation of data (without access to original you can’t really test)
  • high levels of aggregation are needed on most data for them to be opened safely
  • make a licence instead of open data