emorisse/ProblemsInDataAnonymity.md

## ProblemsInDataAnonymity.md

      
    Raw
  

              ProblemsInDataAnonymity.md
            
          
    What is Data Anonymity?

http://en.wikipedia.org/wiki/Data_anonymization
Why Data Anonymity?


Encourage release of data while protecting individuals/organizations
Enable wider set of consumers of data
Think through why (and whether) the data should be anonymized. Example policy discussion on NYC taxi data

Recommendations (up for debate)


Use of good hashing practices

Hashing provides consistent size output - no information leakage from output size
for security, requires a salt (the hashing version of an encryption key)
key stretching (iterate many times) stack overflow


hash = sha256(text + salt)
for x in range(1, 5000): 
	hash = sha256(hash + text + salt)    

Hash salt selection best practices. (Written for passwords, but mostly applies).
External mapping of unique identifiers

1st SSN seen = 1, rather than hash(SSN) or enc(SSN)
add in some random ordering to reduce attack with known time/ordering


Be aware of inter-relationships among data.

Example: IP addresses & subnet. Is the fact that two addresses are on the same/different subnet information you want to provide or hide?


Use encryption rather than hashing -
Hashing is recommended in preference to encryption, when used correctly. Encryption is reversible (difficult), but also can provide additional information about the underlying data, especially if the texts are different lengths. For example, with two unique identifiers: "Erich" and "Erich Morisse", the output of enc("Erich") is likely to be shorter than that of enc("Erich Morisse"). By contrast, the output of hash("Erich" + salt) and of hash("Erich Morisse" + salt) will be the same length. Contrary opinions here and here.

Types of Anonymous Data


Unaltered
Obscured unique identifiers

SSN -> XXX-XX-XXXX
hash(text + salt)


Partially obscured identifiers - requires understanding of data type

Phone number:  212-555-1212 -> 212-XXX-XXXX
Credit card number: 4111 1111 1111 1111 -> 4XXX XXXX XXXX XXXX
IP address: 192.168.0.14 -> 192.168.XXX.XXX
clearText + hash(obscuredText + salt)


Obscured value ranges

preserve the range of the values
preserve the distribution of the values
Normalize the values  (Issues in normalization)


newValue = (oldValue - minValue) * 100/(maxValue-minValue) /* new range 0 to 100, distribution preserved */
Classes of Anonymity Flaws


No process
Flaws in process
Data flaws causing information leakage

See Laplace noise as method of adding some "variety" to data.
Considerations  with sample size and Laplace noise


External flaws (application of external context)

Adding external data experiment


What level of anonymity is desired?

Processes for Data Set Anonymity


Peer review of process
Public expression/certification of level of anonymity

Examples of Need


NYC Taxi Data easily de-anonymized On Taxis and Rainbows
De-anonymized health records

Reading list


Differential Privacy aims to provide means to maximize the accuracy of queries from statistical databases while minimizing the chances of identifying its records.
Transactions on Data Privacy


The Cooperative Association for Internet Data Analysis


Tests and Strategies for concealing and identifying gender online