What is Data Anonymity?
Why Data Anonymity?
Recommendations (up for debate)
- Encourage release of data while protecting individuals/organizations
- Enable wider set of consumers of data
- Think through why (and whether) the data should be anonymized. Example policy discussion on NYC taxi data
- Use of good hashing practices
- Hashing provides consistent size output - no information leakage from output size
- for security, requires a salt (the hashing version of an encryption key)
- key stretching (iterate many times) stack overflow
hash = sha256(text + salt)
for x in range(1, 5000):
hash = sha256(hash + text + salt)
Types of Anonymous Data
- Hash salt selection best practices. (Written for passwords, but mostly applies).
- External mapping of unique identifiers
- 1st SSN seen = 1, rather than hash(SSN) or enc(SSN)
- add in some random ordering to reduce attack with known time/ordering
- Be aware of inter-relationships among data.
- Example: IP addresses & subnet. Is the fact that two addresses are on the same/different subnet information you want to provide or hide?
Use encryption rather than hashing -
Hashing is recommended in preference to encryption, when used correctly. Encryption is reversible (difficult), but also can provide additional information about the underlying data, especially if the texts are different lengths. For example, with two unique identifiers: "Erich" and "Erich Morisse", the output of enc("Erich") is likely to be shorter than that of enc("Erich Morisse"). By contrast, the output of hash("Erich" + salt) and of hash("Erich Morisse" + salt) will be the same length. Contrary opinions here and here.
- Obscured unique identifiers
- SSN -> XXX-XX-XXXX
- hash(text + salt)
- Partially obscured identifiers - requires understanding of data type
- Phone number: 212-555-1212 -> 212-XXX-XXXX
- Credit card number: 4111 1111 1111 1111 -> 4XXX XXXX XXXX XXXX
- IP address: 192.168.0.14 -> 192.168.XXX.XXX
- clearText + hash(obscuredText + salt)
- Obscured value ranges
- preserve the range of the values
- preserve the distribution of the values
- Normalize the values (Issues in normalization)
Classes of Anonymity Flaws
newValue = (oldValue - minValue) * 100/(maxValue-minValue) /* new range 0 to 100, distribution preserved */
Processes for Data Set Anonymity
- No process
- Flaws in process
- Data flaws causing information leakage
- External flaws (application of external context)
- What level of anonymity is desired?
Examples of Need
- Peer review of process
- Public expression/certification of level of anonymity