Skip to content

Instantly share code, notes, and snippets.

@ludflu
Last active November 1, 2018 18:45
Show Gist options
  • Save ludflu/c3e8f57adaff2d4473670ffcc7e9a82d to your computer and use it in GitHub Desktop.
Save ludflu/c3e8f57adaff2d4473670ffcc7e9a82d to your computer and use it in GitHub Desktop.
evolution of data privacy techniques

The evolution of privacy technology

  1. Data sanitizing - supressing identifiers
  2. k-Anonymity (Sweeney & Samarai, 1998) - each individual contained in dataset is indistinguisable from k-1 other users

In practice, it works by a combination of supressing identifiers and bucketing values https://en.wikipedia.org/wiki/K-anonymity The algorithm k-Optimize by Bayardo and Agrawal (2005) approximates k-Anonymity . It aims to perform the "lowest cost" anonymization - meaning it supresses and aggregates data a little as possible in order to achieve the required "k" not great for high-dimensional datasets

finding the optimal k-anonymization is a powerset search problem

  1. l-Diversity - remedies the "homogeneity" attack where sensitive values are not "well represented" in a dataset. Even if a dataset is k-annonymized, sensitve information can be discerned if one knows that a particular individual will fall in some group K.

https://www.utdallas.edu/~muratk/courses/privacy08f_files/ldiversity.pdf

Homogeneity Attack: Alice and Bob are antagonistic
neighbors. One day Bob falls ill and is taken by ambulance
to the hospital. Having seen the ambulance, Alice sets out
to discover what disease Bob is suffering from. Alice discovers
the 4-anonymous table of current inpatient records
published by the hospital (Figure 2), and so she knows that
one of the records in this table contains Bob’s data. Since
Alice is Bob’s neighbor, she knows that Bob is a 31-year-old
American male who lives in the zip code 13053. Therefore,
Alice knows that Bob’s record number is 9,10,11, or 12.
Now, all of those patients have the same medical condition
(cancer), and so Alice concludes that Bob has cancer.

So, l-diversity will ensure that, for each group k of individuals, sensitive values are well represented, so that Alic can't conclude that Bob has cancer, because several other conditions are also represented. Note - I think this means fudging the data!

  1. Differential Privacy
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment