Skip to content

Instantly share code, notes, and snippets.

@datapluspeople
Last active August 17, 2019 20:59
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save datapluspeople/537b8548ccaa9b1521ec1dae714a6afe to your computer and use it in GitHub Desktop.
Save datapluspeople/537b8548ccaa9b1521ec1dae714a6afe to your computer and use it in GitHub Desktop.
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<h1>GENERATING FATA</h1>\n",
"<h2>Fake dATA</h2>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<h2>Intro</h2>\n",
"<h3>IBM Watson Data</h3>\n",
"\n",
"For many of the early workbooks here, we've stood on the shoulders of others. We simply imported a dataset that was created for the Watson HR Analytics work. "
]
},
{
"cell_type": "code",
"execution_count": 125,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Unnamed: 0</th>\n",
" <th>Age</th>\n",
" <th>Attrition</th>\n",
" <th>BusinessTravel</th>\n",
" <th>DailyRate</th>\n",
" <th>Department</th>\n",
" <th>DistanceFromHome</th>\n",
" <th>Education</th>\n",
" <th>EducationField</th>\n",
" <th>EmployeeCount</th>\n",
" <th>...</th>\n",
" <th>RelationshipSatisfaction</th>\n",
" <th>StandardHours</th>\n",
" <th>StockOptionLevel</th>\n",
" <th>TotalWorkingYears</th>\n",
" <th>TrainingTimesLastYear</th>\n",
" <th>WorkLifeBalance</th>\n",
" <th>YearsAtCompany</th>\n",
" <th>YearsInCurrentRole</th>\n",
" <th>YearsSinceLastPromotion</th>\n",
" <th>YearsWithCurrManager</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>0</td>\n",
" <td>41</td>\n",
" <td>Yes</td>\n",
" <td>Travel_Rarely</td>\n",
" <td>1102</td>\n",
" <td>Sales</td>\n",
" <td>1</td>\n",
" <td>2</td>\n",
" <td>Life Sciences</td>\n",
" <td>1</td>\n",
" <td>...</td>\n",
" <td>1</td>\n",
" <td>80</td>\n",
" <td>0</td>\n",
" <td>8</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>6</td>\n",
" <td>4</td>\n",
" <td>0</td>\n",
" <td>5</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>1</td>\n",
" <td>49</td>\n",
" <td>No</td>\n",
" <td>Travel_Frequently</td>\n",
" <td>279</td>\n",
" <td>Research &amp; Development</td>\n",
" <td>8</td>\n",
" <td>1</td>\n",
" <td>Life Sciences</td>\n",
" <td>1</td>\n",
" <td>...</td>\n",
" <td>4</td>\n",
" <td>80</td>\n",
" <td>1</td>\n",
" <td>10</td>\n",
" <td>3</td>\n",
" <td>3</td>\n",
" <td>10</td>\n",
" <td>7</td>\n",
" <td>1</td>\n",
" <td>7</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>2</td>\n",
" <td>37</td>\n",
" <td>Yes</td>\n",
" <td>Travel_Rarely</td>\n",
" <td>1373</td>\n",
" <td>Research &amp; Development</td>\n",
" <td>2</td>\n",
" <td>2</td>\n",
" <td>Other</td>\n",
" <td>1</td>\n",
" <td>...</td>\n",
" <td>2</td>\n",
" <td>80</td>\n",
" <td>0</td>\n",
" <td>7</td>\n",
" <td>3</td>\n",
" <td>3</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>3</td>\n",
" <td>33</td>\n",
" <td>No</td>\n",
" <td>Travel_Frequently</td>\n",
" <td>1392</td>\n",
" <td>Research &amp; Development</td>\n",
" <td>3</td>\n",
" <td>4</td>\n",
" <td>Life Sciences</td>\n",
" <td>1</td>\n",
" <td>...</td>\n",
" <td>3</td>\n",
" <td>80</td>\n",
" <td>0</td>\n",
" <td>8</td>\n",
" <td>3</td>\n",
" <td>3</td>\n",
" <td>8</td>\n",
" <td>7</td>\n",
" <td>3</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>4</td>\n",
" <td>27</td>\n",
" <td>No</td>\n",
" <td>Travel_Rarely</td>\n",
" <td>591</td>\n",
" <td>Research &amp; Development</td>\n",
" <td>2</td>\n",
" <td>1</td>\n",
" <td>Medical</td>\n",
" <td>1</td>\n",
" <td>...</td>\n",
" <td>4</td>\n",
" <td>80</td>\n",
" <td>1</td>\n",
" <td>6</td>\n",
" <td>3</td>\n",
" <td>3</td>\n",
" <td>2</td>\n",
" <td>2</td>\n",
" <td>2</td>\n",
" <td>2</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>5 rows × 36 columns</p>\n",
"</div>"
],
"text/plain": [
" Unnamed: 0 Age Attrition BusinessTravel DailyRate \\\n",
"0 0 41 Yes Travel_Rarely 1102 \n",
"1 1 49 No Travel_Frequently 279 \n",
"2 2 37 Yes Travel_Rarely 1373 \n",
"3 3 33 No Travel_Frequently 1392 \n",
"4 4 27 No Travel_Rarely 591 \n",
"\n",
" Department DistanceFromHome Education EducationField \\\n",
"0 Sales 1 2 Life Sciences \n",
"1 Research & Development 8 1 Life Sciences \n",
"2 Research & Development 2 2 Other \n",
"3 Research & Development 3 4 Life Sciences \n",
"4 Research & Development 2 1 Medical \n",
"\n",
" EmployeeCount ... RelationshipSatisfaction StandardHours \\\n",
"0 1 ... 1 80 \n",
"1 1 ... 4 80 \n",
"2 1 ... 2 80 \n",
"3 1 ... 3 80 \n",
"4 1 ... 4 80 \n",
"\n",
" StockOptionLevel TotalWorkingYears TrainingTimesLastYear WorkLifeBalance \\\n",
"0 0 8 0 1 \n",
"1 1 10 3 3 \n",
"2 0 7 3 3 \n",
"3 0 8 3 3 \n",
"4 1 6 3 3 \n",
"\n",
" YearsAtCompany YearsInCurrentRole YearsSinceLastPromotion \\\n",
"0 6 4 0 \n",
"1 10 7 1 \n",
"2 0 0 0 \n",
"3 8 7 3 \n",
"4 2 2 2 \n",
"\n",
" YearsWithCurrManager \n",
"0 5 \n",
"1 7 \n",
"2 0 \n",
"3 0 \n",
"4 2 \n",
"\n",
"[5 rows x 36 columns]"
]
},
"execution_count": 125,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# imports\n",
"import pandas as pd\n",
"\n",
"# updated 2019-08-13\n",
"# IBM has removed the file from their server\n",
"\n",
"# deptecated code\n",
"# read the file \n",
"# url = \"https://community.watsonanalytics.com/wp-content/uploads/2015/03/WA_Fn-UseC_-HR-Employee-Attrition.xlsx\"\n",
"# empl_data = pd.read_excel(url)\n",
"\n",
"# read local file for demonstration\n",
"file = 'Dropbox/WFA/data/WA_Fn-UseC_-HR-Employee-Attrition.xlsx'\n",
"empl_data = pd.read_excel(file)\n",
"empl_data.head()\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This is great to get us started, and gave us a dataset that many others had used - in blog posts, Kaggle competitions, and otherwise. Now, we're ready for more and would like to generate our own dataset for continued development and exploration."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<h2>pydbgen</h2>\n",
"<b>Random Database/Dataframe Generator</b>\n",
"\n",
"github: https://github.com/tirthajyoti/pydbgen\n",
"read the docs: https://pydbgen.readthedocs.io/en/latest/\n",
"\n",
"from pydbgen documentation:\n",
"</i>Often, beginners in SQL or data science struggle with the matter of easy access to a large sample database file (.DB or .sqlite) for practicing SQL commands. Would it not be great to have a simple tool or library to generate a large database with multiple tables, filled with data of one’s own choice?\n",
"\n",
"After all, databases break every now and then and it is safest to practice with a randomly generated one :-)</i>\n",
"\n",
"That sums it up very well - we need data to practice on, and in a safe way. <u>Especially</u> when we're dealing with PII and senstive data, as we are regularly in HR. It's so commonplace that some, unfortunately, are densensitized to the senstitive nature and requirements, and make a blunder posting to an S3 bin or a similiar, but disastrous mistake.\n",
"\n",
"Generating our own fake data protects us from ourselves. pydbgen allows us to do this very quickly, and generates very realistic data."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<h3>Installing pydbgen</h3>\n",
"As of this writing in August 2019, pydbgen is not available on <code>conda</code> (my preferred installation method). \n",
"\n",
"On both Windows and Linux use <code>pip</code>."
]
},
{
"cell_type": "raw",
"metadata": {},
"source": [
"pip install pydbgen"
]
},
{
"cell_type": "code",
"execution_count": 126,
"metadata": {},
"outputs": [],
"source": [
"# load pydbgen\n",
"\n",
"import pydbgen\n",
"from pydbgen import pydbgen"
]
},
{
"cell_type": "code",
"execution_count": 127,
"metadata": {},
"outputs": [],
"source": [
"db = pydbgen.pydb()\n",
"\n",
"df = db.gen_dataframe(num=100, fields=['name', 'street_address', 'city', 'state', 'zipcode', 'country', 'company', 'job_title', 'phone', 'ssn', 'email', 'month', 'year', 'weekday', 'date', 'time', 'latitude', 'longitude', 'license_plate'], )"
]
},
{
"cell_type": "code",
"execution_count": 128,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>name</th>\n",
" <th>street_address</th>\n",
" <th>city</th>\n",
" <th>state</th>\n",
" <th>zipcode</th>\n",
" <th>country</th>\n",
" <th>company</th>\n",
" <th>job_title</th>\n",
" <th>phone-number</th>\n",
" <th>ssn</th>\n",
" <th>email</th>\n",
" <th>month</th>\n",
" <th>year</th>\n",
" <th>weekday</th>\n",
" <th>date</th>\n",
" <th>time</th>\n",
" <th>latitude</th>\n",
" <th>longitude</th>\n",
" <th>license-plate</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>William Anderson</td>\n",
" <td>7953 Mandy Turnpike</td>\n",
" <td>Swain</td>\n",
" <td>Texas</td>\n",
" <td>78280</td>\n",
" <td>Myanmar</td>\n",
" <td>Brown-Vasquez</td>\n",
" <td>Scientific laboratory technician</td>\n",
" <td>352-368-1239</td>\n",
" <td>228-58-3135</td>\n",
" <td>WAnderson@datapluspeople.com</td>\n",
" <td>None</td>\n",
" <td>1999</td>\n",
" <td>Monday</td>\n",
" <td>2002-11-17</td>\n",
" <td>10:39:23</td>\n",
" <td>-6.9832325</td>\n",
" <td>-30.181752</td>\n",
" <td>BIH-274</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>Dawn Molina</td>\n",
" <td>554 Heather Turnpike Apt. 311</td>\n",
" <td>Pepin</td>\n",
" <td>Oklahoma</td>\n",
" <td>75571</td>\n",
" <td>Malaysia</td>\n",
" <td>Martinez, Thomas and Henry</td>\n",
" <td>Chartered accountant</td>\n",
" <td>245-361-8447</td>\n",
" <td>252-39-2457</td>\n",
" <td>Dawn.Molina@datapluspeople.com</td>\n",
" <td>None</td>\n",
" <td>1990</td>\n",
" <td>Friday</td>\n",
" <td>2015-10-26</td>\n",
" <td>01:48:58</td>\n",
" <td>41.9638895</td>\n",
" <td>-33.070358</td>\n",
" <td>EYV-268</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>Timothy Alexander</td>\n",
" <td>6693 Donald Plain</td>\n",
" <td>Moore</td>\n",
" <td>Delaware</td>\n",
" <td>94146</td>\n",
" <td>Chad</td>\n",
" <td>Diaz-Bruce</td>\n",
" <td>Camera operator</td>\n",
" <td>701-463-6626</td>\n",
" <td>602-26-0601</td>\n",
" <td>Alexander_Timothy67@datapluspeople.com</td>\n",
" <td>None</td>\n",
" <td>1991</td>\n",
" <td>Saturday</td>\n",
" <td>2009-12-31</td>\n",
" <td>01:51:05</td>\n",
" <td>-46.888624</td>\n",
" <td>-32.441572</td>\n",
" <td>AAN-6293</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>Bradley Walter</td>\n",
" <td>24543 Adams Fort</td>\n",
" <td>Sydney</td>\n",
" <td>Indiana</td>\n",
" <td>33266</td>\n",
" <td>Ecuador</td>\n",
" <td>Jackson-Lang</td>\n",
" <td>Company secretary</td>\n",
" <td>420-550-7054</td>\n",
" <td>563-67-3139</td>\n",
" <td>BradleyWalter94@datapluspeople.com</td>\n",
" <td>None</td>\n",
" <td>1970</td>\n",
" <td>Thursday</td>\n",
" <td>1979-02-10</td>\n",
" <td>22:24:59</td>\n",
" <td>-7.668391</td>\n",
" <td>-166.274743</td>\n",
" <td>8QSM719</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>Daniel Allen</td>\n",
" <td>9189 Cynthia Ramp</td>\n",
" <td>Noblestown</td>\n",
" <td>Kentucky</td>\n",
" <td>76651</td>\n",
" <td>France</td>\n",
" <td>Glass PLC</td>\n",
" <td>Biochemist, clinical</td>\n",
" <td>538-078-0566</td>\n",
" <td>533-98-1206</td>\n",
" <td>Daniel.Allen@datapluspeople.com</td>\n",
" <td>None</td>\n",
" <td>1978</td>\n",
" <td>Sunday</td>\n",
" <td>1978-06-02</td>\n",
" <td>15:35:34</td>\n",
" <td>-24.511857</td>\n",
" <td>-35.220806</td>\n",
" <td>CTZ-3918</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" name street_address city state \\\n",
"0 William Anderson 7953 Mandy Turnpike Swain Texas \n",
"1 Dawn Molina 554 Heather Turnpike Apt. 311 Pepin Oklahoma \n",
"2 Timothy Alexander 6693 Donald Plain Moore Delaware \n",
"3 Bradley Walter 24543 Adams Fort Sydney Indiana \n",
"4 Daniel Allen 9189 Cynthia Ramp Noblestown Kentucky \n",
"\n",
" zipcode country company \\\n",
"0 78280 Myanmar Brown-Vasquez \n",
"1 75571 Malaysia Martinez, Thomas and Henry \n",
"2 94146 Chad Diaz-Bruce \n",
"3 33266 Ecuador Jackson-Lang \n",
"4 76651 France Glass PLC \n",
"\n",
" job_title phone-number ssn \\\n",
"0 Scientific laboratory technician 352-368-1239 228-58-3135 \n",
"1 Chartered accountant 245-361-8447 252-39-2457 \n",
"2 Camera operator 701-463-6626 602-26-0601 \n",
"3 Company secretary 420-550-7054 563-67-3139 \n",
"4 Biochemist, clinical 538-078-0566 533-98-1206 \n",
"\n",
" email month year weekday date \\\n",
"0 WAnderson@datapluspeople.com None 1999 Monday 2002-11-17 \n",
"1 Dawn.Molina@datapluspeople.com None 1990 Friday 2015-10-26 \n",
"2 Alexander_Timothy67@datapluspeople.com None 1991 Saturday 2009-12-31 \n",
"3 BradleyWalter94@datapluspeople.com None 1970 Thursday 1979-02-10 \n",
"4 Daniel.Allen@datapluspeople.com None 1978 Sunday 1978-06-02 \n",
"\n",
" time latitude longitude license-plate \n",
"0 10:39:23 -6.9832325 -30.181752 BIH-274 \n",
"1 01:48:58 41.9638895 -33.070358 EYV-268 \n",
"2 01:51:05 -46.888624 -32.441572 AAN-6293 \n",
"3 22:24:59 -7.668391 -166.274743 8QSM719 \n",
"4 15:35:34 -24.511857 -35.220806 CTZ-3918 "
]
},
"execution_count": 128,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<h3>pydbgen Summary</h3>\n",
"Overall, pydbgen is a great, quick way to generate any amount of data. The limitation is primarily the data types that are supported currently. The fields shown above in this example are the extent of fields available as of this writing. These are a great start and for certain situations, these are more than you need. A field such as License Plate is a nice inclusion.\n",
"\n",
"The documentation a little lacking. For example, the documentation (as of this writing) does not mention the 'Domains.txt' file required to generate email addresses. The documentation, however, does point us to <code>Faker</code> - which <code>pydbgen</code> builds upon to generate the fata (fake data). We'll explore <code>Faker</code> in the next section."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<h2>Faker</h2>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": []
},
{
"cell_type": "code",
"execution_count": 129,
"metadata": {},
"outputs": [],
"source": [
"from faker import Faker"
]
},
{
"cell_type": "code",
"execution_count": 130,
"metadata": {},
"outputs": [],
"source": [
"fake = Faker()"
]
},
{
"cell_type": "code",
"execution_count": 131,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'Michael Morris'"
]
},
"execution_count": 131,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"fake.name()"
]
},
{
"cell_type": "code",
"execution_count": 132,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'6928 Richard Fort Suite 784\\nEast Nicole, SC 52141'"
]
},
"execution_count": 132,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"fake.address()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"fake."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"fake_df = pd.DataFrame(columns = ['name', 'ssn'])"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"name_list = []\n",
"ssn_list = []\n",
"dob_list = []\n",
"address_list = []\n",
"city_list = []\n",
"state_list = []\n",
"country_list = []\n",
"postal_list = []\n",
"id_list = []\n",
"email_list = []\n",
"username_list = []\n",
"\n",
"\n",
"for i in range(1000):\n",
" name_list.append(fake.name())\n",
" ssn_list.append(fake.ssn())\n",
" dob_list.append(fake.date_of_birth())\n",
" address_list.append(fake.street_address())\n",
" city_list.append(fake.city())\n",
" state_list.append(fake.state_abbr())\n",
" country_list.append(fake.country_code())\n",
" postal_list.append(fake.postalcode())\n",
" email_list.append(fake.email())\n",
" id_list.append(fake.random_int())\n",
" username_list.append(fake.user_name())\n",
" \n",
" \n",
" \n",
"fake_df['name'] = name_list\n",
"fake_df['ssn'] = ssn_list\n",
"fake_df['dob'] = dob_list\n",
"fake_df['address'] = address_list\n",
"fake_df['city'] = city_list\n",
"fake_df['state'] = state_list\n",
"fake_df['country'] = country_list\n",
"fake_df['postal'] = postal_list\n",
"fake_df['id'] = id_list\n",
"fake_df['email'] = email_list\n",
"fake_df['username'] = username_list"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"fake_df"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<h3>Customizing with <code>Faker</code></h3>\n",
"\n",
"<code>Faker</code> allows for the creation of your own providers."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from faker.providers import BaseProvider\n",
"import random\n",
"\n",
"# create the provider. The class name for Faker must be 'Provider'\n",
"class Provider(BaseProvider):\n",
" def gender(self):\n",
" num = random.randint(0,1)\n",
" if num == 0:\n",
" return 'Male'\n",
" else:\n",
" return 'Female'"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"fake.add_provider(Provider)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"fake.gender()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# add this to the DataFrame\n",
"gender_list = []\n",
"\n",
"for i in range(1000):\n",
" gender_list.append(fake.gender())\n",
"\n",
"fake_df['gender'] = gender_list\n",
"\n",
"fake_df['gender'].head()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"fake_df.info()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# convert gender column to category\n",
"fake_df['gender'] = fake_df['gender'].astype('category')\n",
"\n",
"fake_df.info()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"fake_df['gender'].head()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#export the file\n",
"fake_df.to_csv('~/Downloads/FATA.csv')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<h2>Conclusion</h2>\n",
"\n",
"This tutorial showed a few ways in which we can generate fake data - FATA - to allow us to continue to explore and analyze HR data. You could combine this to anonymize your real HR data, to be able to include names, ssn's, etc. all without compromising one of the most fundamental parts of working with HR data - privacy and respect of people's information."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.3"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment