http://en.wikipedia.org/wiki/Data_anonymization
- Encourage release of data while protecting individuals/organizations
- Enable wider set of consumers of data
- Think through why (and whether) the data should be anonymized. Example policy discussion on NYC taxi data
- Use of good hashing practices
- Hashing provides consistent size output - no information leakage from output size
- for security, requires a salt (the hashing version of an encryption key)
- key stretching (iterate many times) stack overflow
hash = sha256(text + salt)
for x in range(1, 5000):
hash = sha256(hash + text + salt)
- Hash salt selection best practices. (Written for passwords, but mostly applies).
- External mapping of unique identifiers
- 1st SSN seen = 1, rather than hash(SSN) or enc(SSN)
- add in some random ordering to reduce attack with known time/ordering
- Be aware of inter-relationships among data.
- Example: IP addresses & subnet. Is the fact that two addresses are on the same/different subnet information you want to provide or hide?
Use encryption rather than hashing- Hashing is recommended in preference to encryption, when used correctly. Encryption is reversible (difficult), but also can provide additional information about the underlying data, especially if the texts are different lengths. For example, with two unique identifiers: "Erich" and "Erich Morisse", the output of enc("Erich") is likely to be shorter than that of enc("Erich Morisse"). By contrast, the output of hash("Erich" + salt) and of hash("Erich Morisse" + salt) will be the same length. Contrary opinions here and here.
- Unaltered
- Obscured unique identifiers
- SSN -> XXX-XX-XXXX
- hash(text + salt)
- Partially obscured identifiers - requires understanding of data type
- Phone number: 212-555-1212 -> 212-XXX-XXXX
- Credit card number: 4111 1111 1111 1111 -> 4XXX XXXX XXXX XXXX
- IP address: 192.168.0.14 -> 192.168.XXX.XXX
- clearText + hash(obscuredText + salt)
- Obscured value ranges
- preserve the range of the values
- preserve the distribution of the values
- Normalize the values (Issues in normalization)
newValue = (oldValue - minValue) * 100/(maxValue-minValue) /* new range 0 to 100, distribution preserved */
- No process
- Flaws in process
- Data flaws causing information leakage
- See Laplace noise as method of adding some "variety" to data.
- Considerations with sample size and Laplace noise
- External flaws (application of external context)
- Adding external data experiment
- What level of anonymity is desired?
- Peer review of process
- Public expression/certification of level of anonymity
- NYC Taxi Data easily de-anonymized On Taxis and Rainbows
- De-anonymized health records
- Differential Privacy aims to provide means to maximize the accuracy of queries from statistical databases while minimizing the chances of identifying its records.
- Transactions on Data Privacy