Skip to content

Instantly share code, notes, and snippets.

@hannesdatta
Created March 6, 2024 10:34
Show Gist options
  • Save hannesdatta/41b95d3da9e93cdb080038780985ecdc to your computer and use it in GitHub Desktop.
Save hannesdatta/41b95d3da9e93cdb080038780985ecdc to your computer and use it in GitHub Desktop.
Anonymizing usernames for web scraping projects
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"id": "aa15a781",
"metadata": {},
"source": [
"# Anonymizing usernames for web scraping projects using hashes"
]
},
{
"cell_type": "markdown",
"id": "3a5f07f7",
"metadata": {},
"source": [
"\n",
"__Background__:\n",
"\n",
"In the context of data privacy, even usernames like `hannesd` are treated as sensitive information. This is because usernames can act as a digital fingerprint, potentially allowing someone to identify and track an individual across multiple platforms or datasets, such as finding their profiles on social media networks like Twitter.\n",
"\n",
"__Solution__:\n",
"\n",
"To safeguard privacy, we transform usernames into a format known as \"hashes.\" Hashing is a one-way process that takes an input (in this case, a username) and produces a fixed-size string of characters, which appears random. This means that `hannesd` will always be converted into the same unique hash, ensuring consistency. However, under normal circumstances, it's impossible to reverse the process — that is, you can't easily derive `hannesd` from its hash. This property is crucial for maintaining user anonymity.\n",
"\n",
"To further enhance security and protect against certain types of attacks (like rainbow table attacks), we employ a technique called \"salting.\" A salt is a secret, random string that we add to the username before hashing it. This means that even if two users have the same username, their final hashes will be different due to the unique salts added. The salt is kept secret and known only to the system applying the hashing process. This additional step complicates any attempt to reverse-engineer the hash back to the original username, adding an extra layer of security and making the hashing process even more secure."
]
},
{
"cell_type": "code",
"execution_count": 14,
"id": "9b8f2cf0",
"metadata": {},
"outputs": [],
"source": [
"import hashlib\n",
"\n",
"def generate_hash(salt, input_string):\n",
" # Ensure the salt is a byte string\n",
" salt_bytes = salt.encode()\n",
" # Combine the salt with the input_string (username)\n",
" salted_input = salt_bytes + input_string.encode()\n",
" # Create a hash object, using a secure hash algorithm (e.g., sha256)\n",
" hash_object = hashlib.sha256(salted_input)\n",
" # Generate the hexadecimal representation of the digest\n",
" hash_hex = hash_object.hexdigest()\n",
" # Return the salt and hash, concatenated\n",
" return hash_hex\n",
"\n",
"# Example usage:\n",
"salt = \"yourPredefinedSalt\" # This should be a secure, randomly generated value stored securely, \n",
" # such as in an environment variable (see\n",
" # https://tilburgsciencehub.com/configure/environment-variables/\n",
" \n",
"username = \"exampleUsername\"\n",
"salted_hash = generate_hash(salt, username)\n"
]
},
{
"cell_type": "markdown",
"id": "fd5553a2",
"metadata": {},
"source": [
"You can now use the function `generate_hash` in your code."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.5"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment