The HybridPatronPseudonymizer
is a Python class that implements a hybrid encryption system that results with a deterministic ciphertext (the encrypted content) output for a given plaintext input. Hybrid encryption combines the benefits of both symmetric and asymmetric encryption.
Protecting the privacy of patrons is of utmost importance for institutions that handle sensitive data. In some cases, the responsibility of safeguarding this privacy is intesified by by law, regulations or even internal institutional policy. In this context, the provided Python module serves as a powerful tool that aims to strike a balance between ensuring patron data confidentiality and preserving its analytical utility.
Pseudonymization is a data protection technique where personally identifiable information (PII) fields within a data record are replaced with artificial identifiers or pseudonyms. This method prioritizes data subject privacy while also facilitating data analysis. With the right keys and credentials, specific elements of the data set can be decrypted for detailed scrutiny. This approach ensures the data's integrity and upholds the patrons' privacy.
Anonymization, on the other hand, is a more stringent data protection approach that permanently removes or alters personally identifiable information (PII) within a dataset. While this ensures the highest level of data subject privacy, it comes at a significant cost to data analysis utility. Once data is anonymized, the process is irreversible, eliminating any possibility of decrypting or retrieving the original information. Consequently, any nuanced insights or context tied to personal identifiers are irretrievably lost.
Within the context of data protection, the General Data Protection Regulation (GDPR) for example references pseudonymization as a potential method to safeguard personal data. However, it's crucial to note that the GDPR neither mandates its usage nor recognizes it as a definitive means to ensure privacy. Pseudonymization, while valuable, does not guarantee protection from GDPR infringements.
It's essential to understand that even pseudonymized data remains within the realm of personal data as per the GDPR and many other regulations and laws. This categorization is because such data can be linked back to an individual when complemented with supplementary details.
As articulated by the GDPR:
"Personal data which have undergone pseudonymisation, which could be attributed to a natural person by the use of additional information should be considered to be information on an identifiable natural person."
While pseudonymization (as implemented by the this Python module), serves as a beneficial tool to aid in data protection, relying on it alone might not be adequate for ensuring complete privacy or total adherence to laws or regulations such as the GDPR.
- Data Storage: Before saving patron data in external databases, or csv, Excel, etc, use the
encrypt
method to ensure it's stored securely. - Analysis: For data analysis tasks, patron identifiers remain unique, but are unable to be linked to patrons directly. Should it be necessary, decryption is possible given the encrypted RSA key and the password needed to unlock it; data at-rest always remains encrypted, ensuring stored security.
- Trust: Trust is foundational in the relationship between patrons and institutions. By taking steps to be transparent and safeguard personal data through pseudonymization, institutions bolster this trust, ensuring that patrons feel secure in their engagements.
- Privacy Regulations With the rise of data privacy regulations globally, such as the GDPR in the European Union, organizations have a responsibility to protect the privacy of their users. Pseudonymizing patron data aids libraries and similar institutions in adhering to these regulations.
- Maintaining both Data and Privacy Integrity In the event of a data breach, pseudonymized data becomes less valuable and useless to hackers. This is because the essential personal identifiers have been substituted for encrypted values, making the linkage of the data to individual patrons more challenging or impossible without other pieces of information--such as private encrypted keys and application secrets. These other pieces of information are securely kept outside of the scope of the data.
By using this module, institutions can maintain a high standard of data privacy while still accessing and using the data as needed.
The class revolves around the following key functionalities:
- Initialization: The constructor initializes the pseudonymizer with RSA private and public keys and requires an
app_secret
for some operations. - RSA Key Pair Generation: A method is provided to generate an RSA key pair and save it to disk.
- Encryption: The class offers a method to encrypt data using AES-SIV (a symmetric encryption algorithm). The AES key is encrypted using the RSA public key.
- Decryption: The decryption method decrypts the provided data by decrypting the AES key using the RSA private key and then using that AES key to decrypt the actual data.
- Key Derivation: A private method derives a symmetric encryption key using PBKDF2 with the SHA256 hash function.
The constructor (__init__
) initializes the pseudonymizer with:
- RSA private and public keys.
- An
app_secret
which is mandatory for certain operations.
The method generate_rsa_key_pair
generates an RSA key pair and saves it to specified file paths.
The encrypt
method:
- Encodes the provided data to bytes if it is a string.
- Derives the AES key using the
_derive_aes_key
method, which uses theapp_secret
andpatron_record
as inputs. - Encrypts the data using AES-SIV mode with the derived AES key.
- Encrypts the derived AES key using the RSA public key with OAEP padding.
- Base64-encodes both the encrypted AES key and the ciphertext before returning.
The decrypt
method decrypts the provided data by:
- Decrypting the AES key using the RSA private key.
- Decrypting the actual data using the derived AES key.
The _derive_key
method derives a symmetric encryption key using PBKDF2 with the SHA256 hash function. This derived key is used as the AES key for symmetric encryption.
In encryption, collisions refer to different plaintexts producing the same ciphertext. The risk of collisions in this implementation is influenced by:
- AES-SIV Mode: Being deterministic, AES-SIV will produce the same ciphertext for the same plaintext and key. However, for different plaintexts or keys, the ciphertexts should be different.
- Key Derivation: The AES key is derived from both the
app_secret
and thepatron_record
. As long as this combination is unique for each encryption, the derived AES key should be unique. - RSA Encryption: RSA encryption with OAEP padding is considered secure. The encrypted AES key will only collide if two different AES keys produce the same RSA-encrypted output, which is highly improbable.
For the HybridPatronPseudonymizer
to function as intended, it's crucial to ensure that the combination of app_secret
and patron_record
is unique for each encryption.
Pseudonymization doesn't hinder meaningful data analysis. Here's why:
Even with the pseudonyms replacing actual identifiers, the relationships between data records remain. This intact relationship ensures that analyses, like determining the circulation of items, remain accurate.
Pseudonymized data sets still hold value for aggregated analyses. For example, one can compute the average number of books borrowed per patron or ascertain the most popular genres.
Over time, the behavior and preferences of patrons can change. Pseudonymized data allows institutions to track these changes over periods without revealing individual identities.
For each real-world entity, pseudonyms remain consistent. This consistency, for instance, ensures that a patron borrowing multiple books will have the same pseudonym across all records, enabling individualized analysis without exposing the patron's identity.
If several data sets undergo pseudonymization using the same methodology and pseudonyms, they can be merged for a more extensive analysis. This combination can reveal insights like the correlation between book borrowing patterns and event attendance.
-
Key Management: The strength of any encryption system largely depends on the secure management of cryptographic keys and secrets. If the encrypted RSA private key or the
app_secret
were to be compromised, the security of all encrypted data would be at risk. Proper storage and backup strategies should be in place. -
Uniqueness of Inputs: For the
HybridPatronPseudonymizer
to function optimally and prevent potential collisions, it's imperative to ensure the combination ofapp_secret
andpatron_record
is unique for each data subject. Failing to maintain this uniqueness and consistancy may compromise the deterministic nature of the encryption and could lead to potential data ambiguities. -
Deterministic Nature: While the deterministic nature of AES-SIV encryption ensures the same plaintext produces the same ciphertext, it also means that repeated encryption of the same data can make the system more vulnerable to certain types of analysis over time. Users should be aware of this when using the system for datasets with a lot of repeated entries.
-
Performance Overhead: Hybrid encryption, by its nature, involves both symmetric and asymmetric encryption operations. This can introduce a performance overhead, especially when dealing with large datasets.
Pseudonymization strikes the perfect balance between safeguarding data privacy and retaining its analytical utility. By ensuring the protection of personal identifiers, data remains meaningful for deriving insights and analysis.
The HybridPatronPseudonymizer
class provides a robust encryption mechanism that combines the efficiency of symmetric encryption with the key exchange benefits of asymmetric encryption. As long as the combination of app_secret
and patron_record
is unique for each data subject, the system should offer a low probability of collisions, making it suitable for most practical purposes.
Hi Ray, I saw you posted this on the code4lib listserv. One thing you may want to point out in your readme is that GDPR still considers pseudonymized data to be personally identifiable:
"Personal data which have undergone pseudonymisation, which could be attributed to a natural person by the use of additional information should be considered to be information on an identifiable natural person."
So just using your class wouldn't be sufficient to free a dataset from GDPR standards.
I looked at the code and it made sense to me but I haven't tried running it yet. You've probably done this already but you should try and get any implementation of cryptography reviewed by the cryptographic community. Code4lib has a lot of talented developers but I don't know how many cryptographers there are.