Skip to content

Instantly share code, notes, and snippets.

@rayvoelker
Last active September 16, 2023 15:52
Show Gist options
  • Save rayvoelker/80c0dfa5cb47e63c7e498bd064d3c0b6 to your computer and use it in GitHub Desktop.
Save rayvoelker/80c0dfa5cb47e63c7e498bd064d3c0b6 to your computer and use it in GitHub Desktop.
Hybrid Patron Pseudonymizer

HybridPatronPseudonymizer

The HybridPatronPseudonymizer is a Python class that implements a hybrid encryption system that results with a deterministic ciphertext (the encrypted content) output for a given plaintext input. Hybrid encryption combines the benefits of both symmetric and asymmetric encryption.

Introduction

Protecting the privacy of patrons is of utmost importance for institutions that handle sensitive data. In some cases, the responsibility of safeguarding this privacy is intesified by by law, regulations or even internal institutional policy. In this context, the provided Python module serves as a powerful tool that aims to strike a balance between ensuring patron data confidentiality and preserving its analytical utility.

Pseudonymization is a data protection technique where personally identifiable information (PII) fields within a data record are replaced with artificial identifiers or pseudonyms. This method prioritizes data subject privacy while also facilitating data analysis. With the right keys and credentials, specific elements of the data set can be decrypted for detailed scrutiny. This approach ensures the data's integrity and upholds the patrons' privacy.

Anonymization, on the other hand, is a more stringent data protection approach that permanently removes or alters personally identifiable information (PII) within a dataset. While this ensures the highest level of data subject privacy, it comes at a significant cost to data analysis utility. Once data is anonymized, the process is irreversible, eliminating any possibility of decrypting or retrieving the original information. Consequently, any nuanced insights or context tied to personal identifiers are irretrievably lost.

Within the context of data protection, the General Data Protection Regulation (GDPR) for example references pseudonymization as a potential method to safeguard personal data. However, it's crucial to note that the GDPR neither mandates its usage nor recognizes it as a definitive means to ensure privacy. Pseudonymization, while valuable, does not guarantee protection from GDPR infringements.

It's essential to understand that even pseudonymized data remains within the realm of personal data as per the GDPR and many other regulations and laws. This categorization is because such data can be linked back to an individual when complemented with supplementary details.

As articulated by the GDPR:

"Personal data which have undergone pseudonymisation, which could be attributed to a natural person by the use of additional information should be considered to be information on an identifiable natural person."

While pseudonymization (as implemented by the this Python module), serves as a beneficial tool to aid in data protection, relying on it alone might not be adequate for ensuring complete privacy or total adherence to laws or regulations such as the GDPR.

Potential Uses:

  1. Data Storage: Before saving patron data in external databases, or csv, Excel, etc, use the encrypt method to ensure it's stored securely.
  2. Analysis: For data analysis tasks, patron identifiers remain unique, but are unable to be linked to patrons directly. Should it be necessary, decryption is possible given the encrypted RSA key and the password needed to unlock it; data at-rest always remains encrypted, ensuring stored security.
  3. Trust: Trust is foundational in the relationship between patrons and institutions. By taking steps to be transparent and safeguard personal data through pseudonymization, institutions bolster this trust, ensuring that patrons feel secure in their engagements.
  4. Privacy Regulations With the rise of data privacy regulations globally, such as the GDPR in the European Union, organizations have a responsibility to protect the privacy of their users. Pseudonymizing patron data aids libraries and similar institutions in adhering to these regulations.
  5. Maintaining both Data and Privacy Integrity In the event of a data breach, pseudonymized data becomes less valuable and useless to hackers. This is because the essential personal identifiers have been substituted for encrypted values, making the linkage of the data to individual patrons more challenging or impossible without other pieces of information--such as private encrypted keys and application secrets. These other pieces of information are securely kept outside of the scope of the data.

By using this module, institutions can maintain a high standard of data privacy while still accessing and using the data as needed.

Overview

The class revolves around the following key functionalities:

  1. Initialization: The constructor initializes the pseudonymizer with RSA private and public keys and requires an app_secret for some operations.
  2. RSA Key Pair Generation: A method is provided to generate an RSA key pair and save it to disk.
  3. Encryption: The class offers a method to encrypt data using AES-SIV (a symmetric encryption algorithm). The AES key is encrypted using the RSA public key.
  4. Decryption: The decryption method decrypts the provided data by decrypting the AES key using the RSA private key and then using that AES key to decrypt the actual data.
  5. Key Derivation: A private method derives a symmetric encryption key using PBKDF2 with the SHA256 hash function.

Detailed Breakdown

Initialization

The constructor (__init__) initializes the pseudonymizer with:

  • RSA private and public keys.
  • An app_secret which is mandatory for certain operations.

Key Pair Generation

The method generate_rsa_key_pair generates an RSA key pair and saves it to specified file paths.

Encryption

The encrypt method:

  1. Encodes the provided data to bytes if it is a string.
  2. Derives the AES key using the _derive_aes_key method, which uses the app_secret and patron_record as inputs.
  3. Encrypts the data using AES-SIV mode with the derived AES key.
  4. Encrypts the derived AES key using the RSA public key with OAEP padding.
  5. Base64-encodes both the encrypted AES key and the ciphertext before returning.

Decryption

The decrypt method decrypts the provided data by:

  1. Decrypting the AES key using the RSA private key.
  2. Decrypting the actual data using the derived AES key.

Key Derivation

The _derive_key method derives a symmetric encryption key using PBKDF2 with the SHA256 hash function. This derived key is used as the AES key for symmetric encryption.

Collision Considerations

In encryption, collisions refer to different plaintexts producing the same ciphertext. The risk of collisions in this implementation is influenced by:

  1. AES-SIV Mode: Being deterministic, AES-SIV will produce the same ciphertext for the same plaintext and key. However, for different plaintexts or keys, the ciphertexts should be different.
  2. Key Derivation: The AES key is derived from both the app_secret and the patron_record. As long as this combination is unique for each encryption, the derived AES key should be unique.
  3. RSA Encryption: RSA encryption with OAEP padding is considered secure. The encrypted AES key will only collide if two different AES keys produce the same RSA-encrypted output, which is highly improbable.

For the HybridPatronPseudonymizer to function as intended, it's crucial to ensure that the combination of app_secret and patron_record is unique for each encryption.

Pseudonymization and Data Analysis:

Pseudonymization doesn't hinder meaningful data analysis. Here's why:

1. Maintaining Data Relationships:

Even with the pseudonyms replacing actual identifiers, the relationships between data records remain. This intact relationship ensures that analyses, like determining the circulation of items, remain accurate.

2. Aggregated Analysis:

Pseudonymized data sets still hold value for aggregated analyses. For example, one can compute the average number of books borrowed per patron or ascertain the most popular genres.

3. Temporal Analysis:

Over time, the behavior and preferences of patrons can change. Pseudonymized data allows institutions to track these changes over periods without revealing individual identities.

4. Consistency of Pseudonyms:

For each real-world entity, pseudonyms remain consistent. This consistency, for instance, ensures that a patron borrowing multiple books will have the same pseudonym across all records, enabling individualized analysis without exposing the patron's identity.

5. Joining Multiple Data Sets:

If several data sets undergo pseudonymization using the same methodology and pseudonyms, they can be merged for a more extensive analysis. This combination can reveal insights like the correlation between book borrowing patterns and event attendance.

Limitations

  1. Key Management: The strength of any encryption system largely depends on the secure management of cryptographic keys and secrets. If the encrypted RSA private key or the app_secret were to be compromised, the security of all encrypted data would be at risk. Proper storage and backup strategies should be in place.

  2. Uniqueness of Inputs: For the HybridPatronPseudonymizer to function optimally and prevent potential collisions, it's imperative to ensure the combination of app_secret and patron_record is unique for each data subject. Failing to maintain this uniqueness and consistancy may compromise the deterministic nature of the encryption and could lead to potential data ambiguities.

  3. Deterministic Nature: While the deterministic nature of AES-SIV encryption ensures the same plaintext produces the same ciphertext, it also means that repeated encryption of the same data can make the system more vulnerable to certain types of analysis over time. Users should be aware of this when using the system for datasets with a lot of repeated entries.

  4. Performance Overhead: Hybrid encryption, by its nature, involves both symmetric and asymmetric encryption operations. This can introduce a performance overhead, especially when dealing with large datasets.

Conclusion

Pseudonymization strikes the perfect balance between safeguarding data privacy and retaining its analytical utility. By ensuring the protection of personal identifiers, data remains meaningful for deriving insights and analysis.

The HybridPatronPseudonymizer class provides a robust encryption mechanism that combines the efficiency of symmetric encryption with the key exchange benefits of asymmetric encryption. As long as the combination of app_secret and patron_record is unique for each data subject, the system should offer a low probability of collisions, making it suitable for most practical purposes.

import base64
import os
from cryptography.hazmat.primitives import serialization
from cryptography.hazmat.primitives.asymmetric import rsa, padding as rsa_padding
from cryptography.hazmat.primitives.ciphers.aead import AESSIV
from cryptography.hazmat.backends import default_backend
from cryptography.hazmat.primitives.kdf.pbkdf2 import PBKDF2HMAC
from cryptography.hazmat.primitives import hashes
class HybridPatronPseudonymizer:
def __init__(
self,
private_key_path='private_key.pem',
public_key_path='public_key.pem',
app_secret=None,
rsa_key_password=None
):
if app_secret is None:
raise ValueError('Error: need app_secret to continue')
self.app_secret = app_secret
# If an RSA key password is provided, encode it, otherwise, set it to None
self.rsa_key_password = rsa_key_password.encode() if rsa_key_password else None
# Load RSA public key if it exists
if os.path.exists(public_key_path):
with open(public_key_path, 'rb') as f:
self.rsa_public_key = serialization.load_pem_public_key(
f.read(), backend=default_backend()
)
else:
raise ValueError("Public key file not found.")
# Load RSA private key if it exists
if os.path.exists(private_key_path):
with open(private_key_path, 'rb') as f:
self.rsa_private_key = serialization.load_pem_private_key(
f.read(), password=self.rsa_key_password, backend=default_backend()
)
else:
self.rsa_private_key = None # Set to None if private key file is missing
@staticmethod
def generate_rsa_key_pair(private_key_path='private_key.pem', public_key_path='public_key.pem', rsa_key_password=None):
# Generate an RSA key pair
rsa_private_key = rsa.generate_private_key(
public_exponent=65537,
key_size=2048,
backend=default_backend()
)
rsa_public_key = rsa_private_key.public_key()
# Serialize and save the private key
encryption_algo = \
serialization.BestAvailableEncryption(rsa_key_password.encode()) \
if rsa_key_password \
else serialization.NoEncryption()
pem_private = rsa_private_key.private_bytes(
encoding=serialization.Encoding.PEM,
format=serialization.PrivateFormat.PKCS8,
encryption_algorithm=encryption_algo
)
with open(private_key_path, 'wb') as f:
f.write(pem_private)
# Serialize and save the public key
pem_public = rsa_public_key.public_bytes(
encoding=serialization.Encoding.PEM,
format=serialization.PublicFormat.SubjectPublicKeyInfo
)
with open(public_key_path, 'wb') as f:
f.write(pem_public)
print("RSA key pair generated and saved.")
def _derive_aes_key(self, app_secret, patron_record):
# Combine patron record's id and createdDate and app_secret
combined_data = (
str(patron_record["id"]) \
+ str(patron_record["createdDate"]) \
+ str(app_secret)
).encode()
# Use PBKDF2 to derive the AES key
# AES-128-SIV requires a key that's double the size of standard AES-128 ...
# AES-128: Requires a 128-bit (or 16-byte) key.
# AES-192: Requires a 192-bit (or 24-byte) key.
# AES-256: Requires a 256-bit (or 32-byte) key.
kdf = PBKDF2HMAC(
algorithm=hashes.SHA256(),
length=32, # see above
salt=combined_data,
iterations=1000,
backend=default_backend()
)
return kdf.derive(combined_data)
def encrypt(self, data, patron_record):
if isinstance(data, str):
data = data.encode()
# Derive the AES key using password (app_secret _and_ patron record
aes_key = self._derive_aes_key(self.app_secret, patron_record)
# Encrypt the data using AES-SIV mode
ciphertext = AESSIV(aes_key).encrypt(data, [])
# Encrypt the derived AES key using RSA public key
encrypted_aes_key = self.rsa_public_key.encrypt(
aes_key,
rsa_padding.OAEP(
mgf=rsa_padding.MGF1(algorithm=hashes.SHA256()),
algorithm=hashes.SHA256(),
label=None
)
)
# Base64-encode both the encrypted AES key and the ciphertext before returning
return {
'encrypted_key': base64.b64encode(encrypted_aes_key).decode(),
'ciphertext': base64.b64encode(ciphertext).decode()
}
def decrypt(self, encrypted_data, patron_record):
# Check if private key is loaded
if not self.rsa_private_key:
raise ValueError("Private key is missing. Decryption not possible.")
# Base64-decode the encrypted AES key and ciphertext
encrypted_aes_key = base64.b64decode(encrypted_data['encrypted_key'].encode())
ciphertext = base64.b64decode(encrypted_data['ciphertext'].encode())
# Decrypt the AES key using RSA private key
aes_key = self.rsa_private_key.decrypt(
encrypted_aes_key,
rsa_padding.OAEP(
mgf=rsa_padding.MGF1(algorithm=hashes.SHA256()),
algorithm=hashes.SHA256(),
label=None
)
)
# Decrypt the data using AES-SIV
plaintext = AESSIV(aes_key).decrypt(ciphertext, [])
return plaintext
if __name__ == "__main__":
# %%timeit
# Below is a general use pattern for this module:
# 1. Generate the RSA Key Pair (only run once to generate and save the keys)
HybridPatronPseudonymizer.generate_rsa_key_pair(
private_key_path='dummy-test-private_key.pem',
public_key_path='dummy-test-public_key.pem',
rsa_key_password="secret (don't really use this for the love of Cthulhu)"
)
# Load the public key and (optionally) the encrypted private key
pseudonymizer = HybridPatronPseudonymizer(
private_key_path='dummy-test-private_key.pem',
public_key_path='dummy-test-public_key.pem',
rsa_key_password="secret (don't really use this for the love of Cthulhu)",
app_secret="again, pick something different if you enjoy the love of Cthulhu"
)
for i in range(2):
patron_record = {"id": 2198439, "createdDate": "2017-03-28T01:10:11Z"}
encrypted_data_bundle_name = pseudonymizer.encrypt("Chimperson, Chimpy H", patron_record)
encrypted_data_bundle_barcode = pseudonymizer.encrypt("678999", patron_record)
# 2. Decrypt Data
# If the private key is encrypted, provide the password when creating the pseudonymizer instance
try:
decrypted_data1 = pseudonymizer.decrypt(encrypted_data_bundle_name, patron_record)
decrypted_data2 = pseudonymizer.decrypt(encrypted_data_bundle_barcode, patron_record)
except Exception as e:
print(e)
print(
encrypted_data_bundle_name,
encrypted_data_bundle_barcode,
"---",
decrypted_data1.decode(),
decrypted_data2.decode(),
sep="\n",
end="\n\n\n"
)
"""result:
RSA key pair generated and saved.
{
'encrypted_key': 'CzlozHw4JM1VAJ6X14yN60NgydIAPVUa12cUTzkbwgB4BTi3BhSEJ8agpGvV2x/9Bu76LgyTPMVkPanIvSNO/5nlZBhXunrppm0FEXjW10lSV5PksTHYYR9scF125yvvM13bn9qixddxMTTo9Xhj6MX0959cB2de313Dcx6fo94IW7RmBRVeXfNmJUSZGhZnqNenoLda2LztOmkY3yfqeWBvXpPaT2S5ouHBR2+soHIx93pACAe6AwEM2dHh/zlDQk3JF1fUteYEpLzIU1WPni0dkIPdzAA5e6IxYNGF8nQz26imb8wNIyvDCqm1ypKhkwdoGKPO0yQfA5flER+PRQ==',
'ciphertext': 'U6EtsvmZcFdy0NoRHThVt3yAafIkqUzqAS1n8nlZbzPSuyYR'
}
{
'encrypted_key': 'D2JejrxWdkb8jM3nAHx3KB1D3uVtC+fAN7aKnIvQwfKQ4bE1y0lCk7U6R3eFNiXWYQwhsYONRO31en35CTEcb4nuqvAfnL4eBSdYn4oelRw1D+HkXH1ZO+qHn0uYjBbUOswdnTyduzxy89VxAOoD7AC8C8EVqquqr/YyxyDVz6OXtw4AtdpEpP5ZnFYHRVkx2YY6fyl5BEU/VOjcI6Or77pA9sZgHiY8lCZD31CZsCzcHa3m/qfnfS7ZaWrf959bmlYLuCD1nSnDNzq+YTu7Dhwu4qvJpmohjGgunC8pBmTJT0V/tk8JpOCY78XUaMcLk2ooKdf1/rH7XNam/UdmwQ==',
'ciphertext': 'Aqma5Vj4DG5RnsOlOx72/9aFoS7bAw=='
}
---
Chimperson, Chimpy H
678999
{
'encrypted_key': 'kVdocyJ13C6wpu1Lx3kLbGs2Mz2KUlucXNsswROmhrfMvaWMzwsOZYGOWerXMBqrqfnlc2Qebtcga/rmbBdoV0/lzOHNapED2Fr0yjuduye/XcepV+ZTu0RCE2MmFcFKyU8RqgX26aRIUdiwpVTAkGa6MgC3Zib30U32zmf31tqGvA6/HO/+/H29Hn4rZTo+7ga8ZydVh9jZRAdWZV0+tBJC2uP9kxQ0iG9Bx0GvDVxbvc5AZmb+Ua7ksijmvXncoOUwc19DOP3E+f27G0PnL0gFbSJVl/RzWUSkk6ByGwO2Ps7poVe6S4D6tH8ESP1e9u0lZWlvB7OkNaMzCMdrVw==',
'ciphertext': 'U6EtsvmZcFdy0NoRHThVt3yAafIkqUzqAS1n8nlZbzPSuyYR'
}
{
'encrypted_key': 'LAogNnUXPS3s4OaTRGK7EpBIo1EoZUsOgL9GKH1GCvvRB4+MuADt0VS2UdWUPJiZn+1EyuG4KPgAhpDIn/6s9qV0xv9r7mg5DqB5fmZKMWMxnjFZezwd+UEKMKClNNCxYFmJAwnnXFGvJMGFsHkLElJ36PpesQyzyWyxpIFv+aWiBL2B5FsVW/ikPAoQYp8P+dTYXKq54LAx5SVTftOhLvK+FLQu70bixk6sN7nAgrnjfOfP7mxxYnkdIjCW5I5dwXb4KAaq5z1o6CsLkadmgtOO0R+8JhLCOjiK6JCHJTQwuX5AlR8n6zwBPFS4RbYYAjQmaBiYMFPUPmPISG5oiw==',
'ciphertext': 'Aqma5Vj4DG5RnsOlOx72/9aFoS7bAw=='
}
---
Chimperson, Chimpy H
678999
"""
@maahutch
Copy link

Hi Ray, I saw you posted this on the code4lib listserv. One thing you may want to point out in your readme is that GDPR still considers pseudonymized data to be personally identifiable:

"Personal data which have undergone pseudonymisation, which could be attributed to a natural person by the use of additional information should be considered to be information on an identifiable natural person."

So just using your class wouldn't be sufficient to free a dataset from GDPR standards.

I looked at the code and it made sense to me but I haven't tried running it yet. You've probably done this already but you should try and get any implementation of cryptography reviewed by the cryptographic community. Code4lib has a lot of talented developers but I don't know how many cryptographers there are.

@rayvoelker
Copy link
Author

Thanks so much for the input and suggestions! It's very much appreciated!

You've probably done this already but you should try and get any implementation of cryptography reviewed by the cryptographic community. Code4lib has a lot of talented developers but I don't know how many cryptographers there are.

I haven't done this yet, and I'm honestly not sure where I should start. If you have any suggestions, I'd love to hear them!

@rayvoelker
Copy link
Author

Thanks again @maahutch! I revised the README.md file to be more accurate based on you very helpful input!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment