Skip to content

Instantly share code, notes, and snippets.

@rayvoelker
Last active October 2, 2023 16:09
Show Gist options
  • Save rayvoelker/598172d06bcc94c51dfa8064bbb0cb75 to your computer and use it in GitHub Desktop.
Save rayvoelker/598172d06bcc94c51dfa8064bbb0cb75 to your computer and use it in GitHub Desktop.
StochasticPseudonymizer.ipynb

StochasticPseudonymizer: Cryptographic Token Generation for Replacing Patron Identifiers

StochasticPseudonymizer is a Python class designed to pseudonymize the personally identifiable information (PII) of patrons using the cryptographic hash function PBKDF2HMAC, preserving patron privacy while creating a deliberate uncertainty, introduced by hash collision--keeping it from being possible to prove that any one patron had some specific behavior in the past.

Stochastic refers to the property of being well-described by a random probability distribution https://en.wikipedia.org/wiki/Stochastic

Pseudonymization refers to the technique where personally identifiable information fields within a data record are replaced by one or more artificial identifiers, or pseudonyms. https://en.wikipedia.org/wiki/Pseudonymization

Background

For a detailed and fantastic explanation of the aim and goals of this Pseudonymizer feature, see the excellent deep-dive on the subject, "When Hashes Collide" presented by Steve Gibson on the Security Now! podcast #940. See the Show notes (SN-940-Notes.pdf) and the audio / video of the show (twit.tv/shows/security-now/episodes/940) for more details.

Hash functions play a foundational role in cryptographic systems and make for excellent candidates for use in replacing personally identifiable information (PII) in data retained for statistical purposes--a technique called pseudonymization. Hash functions process an input (in this case, some sort of PII linked to a library patron) and return a fixed-size string, which typically appears random. The characteristics of an ideal cryptographic hash function include:

  • Deterministic Output: For the same input, the output (hash) is always consistent.
  • Rapid Computation: Efficient generation of hash codes for any input.
  • Pre-image Resistance: Regenerating the original input from its hash is computationally infeasible.
  • Sensitivity to Input Changes: Any alteration in the input should produce an uncorrelated and drastically different hash.
  • Collision Resistance: Finding two different inputs that produce the same output should be computationally challenging.

An amazing property of the output of such hash functions is that, as Steve Gibson put it, "every single bit of that pseudo-random number output is independent of all of the other bits – it does not care about any of its neighbors – and given any input, it will have an equal chance of being a 0 or a 1. What that tells us is that there is no need, whatsoever, to use it all." Modern hash functions produce a massive number of unique values–256 bits of the SHA256 output gives us 115*10^75 individual combinations–far more than we need, and as it turns out reducing those possible combinations results in something that enhances the privacy of patron activity recorded and stored in statistical datasets.

The Birthday Paradox illustrates the counterintuitive nature of hash collisions, emphasizing that even with a vast hash space, collisions can occur more frequently than expected. This realization leads to a very useful use case of reducing the number of bits, or "buckets" into which PII can be mapped to. Deliberately creating the increased probability of hash collisions within the patron population while preserving the statistical utility of the data further obscures the behavior of any specific individual patron.

Strategy: Smaller Population Size

One way to introduce this ambiguity is by picking a population size that is smaller than the actual number of items you're hashing. This approach ensures that the hash function produces fewer unique values (or bins) than the total number of items, leading to increased chances of collisions.

Why is this useful?

  • Ambiguity: Multiple items can map to the same hashed value, making it difficult to determine the original item based on the hash alone.
  • Flexibility: By adjusting the population size, you can control the level of ambiguity, tailoring it to the desired level of privacy protection.
  • Statistical Utility: Even with introduced ambiguity, the data retains utility for statistical and analytical purposes.

Python Example

Here's a Python code snippet to compute the expected number of collisions based on the chosen population size:

def expected_collisions(k, N):
    "Compute the expected number of collisions when hashing k items into N bins."
    # Parameters:
    # - k: Number of items being hashed.
    # - N: Number of bins or hash values.
    #
    # Returns:
    # - Expected number of collisions.
    
    # Calculate the expected number of collisions
    E = k - N * (1 - (1 - 1/N)**k)
    return E

# Replace these values as needed
number_of_items = 450_000  # k
number_of_bins = 277_948_468  # N

# Compute the expected collisions
collisions = expected_collisions(number_of_items, number_of_bins)
print(f"Expected number of collisions: {collisions}")

Implementation

The StochasticPseudonymizer class embodies several of the principles discussed:

  • Utilizes PBKDF2HMAC for hashing, ensuring a consistent hash output for a given length of bytes.
  • Incorporates a salt derived from multiple sources, adding an extra layer of security and ensuring unique hashes for similar PIIs.
  • Allows adjustable iteration count to increase the computational intensity of hash generation, making brute-forcing more challenging.
  • Deliberately increasing the probability of hash collisions within the group of data subjects.
class StochasticPseudonymizer:
    ...

Usage

To use the StochasticPseudonymizer:

  1. Initialize the class with your application secret.
  2. Call the generate_token method, passing in the PII and patron record.
app_secret = "YOUR_SECRET_HERE"
pseudonymizer = StochasticPseudonymizer(app_secret)
token = pseudonymizer.generate_token("Sample_PII", patron_record={"id": 12345, "createdDate": "2023-09-23"})

Considerations and Limitations

While it's not possible to "unscramble" the hash, it is still possible (and required for this method to function) to combine the static patron record data, the "app secret" and the PII and derive the hash revealing the patron's activity. It's crucial to protect the app secret, and keep it as secure as possible.

Also, it's important to note that despite all efforts to obscure PII, this method may still generate output that is considered PII by various regulations, laws or policies due to the potential to combine it with additional information.

As articulated by the GDPR:

"Personal data which have undergone pseudonymisation, which could be attributed to a natural person by the use of additional information should be considered to be information on an identifiable natural person."

While pseudonymization (as implemented by the this Python module), serves as a beneficial tool to aid in data protection, relying on it alone might not be adequate for ensuring complete privacy or total adherence to laws or regulations such as the GDPR.

Display the source blob
Display the rendered blob
Raw
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment