Skip to content

Instantly share code, notes, and snippets.

@emorisse
Last active August 29, 2015 14:02
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save emorisse/11418894f5e650f8fc9e to your computer and use it in GitHub Desktop.
Save emorisse/11418894f5e650f8fc9e to your computer and use it in GitHub Desktop.

What is Data Anonymity?

http://en.wikipedia.org/wiki/Data_anonymization

Why Data Anonymity?

  • Encourage release of data while protecting individuals/organizations
  • Enable wider set of consumers of data
  • Think through why (and whether) the data should be anonymized. Example policy discussion on NYC taxi data

Recommendations (up for debate)

  • Use of good hashing practices
    • Hashing provides consistent size output - no information leakage from output size
    • for security, requires a salt (the hashing version of an encryption key)
    • key stretching (iterate many times) stack overflow
hash = sha256(text + salt)
for x in range(1, 5000): 
	hash = sha256(hash + text + salt)    
  • Hash salt selection best practices. (Written for passwords, but mostly applies).
  • External mapping of unique identifiers
    • 1st SSN seen = 1, rather than hash(SSN) or enc(SSN)
    • add in some random ordering to reduce attack with known time/ordering
  • Be aware of inter-relationships among data.
    • Example: IP addresses & subnet. Is the fact that two addresses are on the same/different subnet information you want to provide or hide?
  • Use encryption rather than hashing - Hashing is recommended in preference to encryption, when used correctly. Encryption is reversible (difficult), but also can provide additional information about the underlying data, especially if the texts are different lengths. For example, with two unique identifiers: "Erich" and "Erich Morisse", the output of enc("Erich") is likely to be shorter than that of enc("Erich Morisse"). By contrast, the output of hash("Erich" + salt) and of hash("Erich Morisse" + salt) will be the same length. Contrary opinions here and here.

Types of Anonymous Data

  • Unaltered
  • Obscured unique identifiers
    • SSN -> XXX-XX-XXXX
    • hash(text + salt)
  • Partially obscured identifiers - requires understanding of data type
    • Phone number: 212-555-1212 -> 212-XXX-XXXX
    • Credit card number: 4111 1111 1111 1111 -> 4XXX XXXX XXXX XXXX
    • IP address: 192.168.0.14 -> 192.168.XXX.XXX
    • clearText + hash(obscuredText + salt)
  • Obscured value ranges
    • preserve the range of the values
    • preserve the distribution of the values
    • Normalize the values (Issues in normalization)
newValue = (oldValue - minValue) * 100/(maxValue-minValue) /* new range 0 to 100, distribution preserved */

Classes of Anonymity Flaws

  • No process
  • Flaws in process
  • Data flaws causing information leakage
  • External flaws (application of external context)
  • What level of anonymity is desired?

Processes for Data Set Anonymity

  • Peer review of process
  • Public expression/certification of level of anonymity

Examples of Need

Reading list

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment