datapluspeople/Generating FATA - Fake dATA.ipynb

## Generating FATA - Fake dATA.ipynb
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<h1>GENERATING FATA</h1>\n",
    "<h2>Fake dATA</h2>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<h2>Intro</h2>\n",
    "<h3>IBM Watson Data</h3>\n",
    "\n",
    "For many of the early workbooks here, we've stood on the shoulders of others. We simply imported a dataset that was created for the Watson HR Analytics work. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 125,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Unnamed: 0</th>\n",
       "      <th>Age</th>\n",
       "      <th>Attrition</th>\n",
       "      <th>BusinessTravel</th>\n",
       "      <th>DailyRate</th>\n",
       "      <th>Department</th>\n",
       "      <th>DistanceFromHome</th>\n",
       "      <th>Education</th>\n",
       "      <th>EducationField</th>\n",
       "      <th>EmployeeCount</th>\n",
       "      <th>...</th>\n",
       "      <th>RelationshipSatisfaction</th>\n",
       "      <th>StandardHours</th>\n",
       "      <th>StockOptionLevel</th>\n",
       "      <th>TotalWorkingYears</th>\n",
       "      <th>TrainingTimesLastYear</th>\n",
       "      <th>WorkLifeBalance</th>\n",
       "      <th>YearsAtCompany</th>\n",
       "      <th>YearsInCurrentRole</th>\n",
       "      <th>YearsSinceLastPromotion</th>\n",
       "      <th>YearsWithCurrManager</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0</td>\n",
       "      <td>41</td>\n",
       "      <td>Yes</td>\n",
       "      <td>Travel_Rarely</td>\n",
       "      <td>1102</td>\n",
       "      <td>Sales</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>Life Sciences</td>\n",
       "      <td>1</td>\n",
       "      <td>...</td>\n",
       "      <td>1</td>\n",
       "      <td>80</td>\n",
       "      <td>0</td>\n",
       "      <td>8</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>6</td>\n",
       "      <td>4</td>\n",
       "      <td>0</td>\n",
       "      <td>5</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1</td>\n",
       "      <td>49</td>\n",
       "      <td>No</td>\n",
       "      <td>Travel_Frequently</td>\n",
       "      <td>279</td>\n",
       "      <td>Research &amp; Development</td>\n",
       "      <td>8</td>\n",
       "      <td>1</td>\n",
       "      <td>Life Sciences</td>\n",
       "      <td>1</td>\n",
       "      <td>...</td>\n",
       "      <td>4</td>\n",
       "      <td>80</td>\n",
       "      <td>1</td>\n",
       "      <td>10</td>\n",
       "      <td>3</td>\n",
       "      <td>3</td>\n",
       "      <td>10</td>\n",
       "      <td>7</td>\n",
       "      <td>1</td>\n",
       "      <td>7</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>2</td>\n",
       "      <td>37</td>\n",
       "      <td>Yes</td>\n",
       "      <td>Travel_Rarely</td>\n",
       "      <td>1373</td>\n",
       "      <td>Research &amp; Development</td>\n",
       "      <td>2</td>\n",
       "      <td>2</td>\n",
       "      <td>Other</td>\n",
       "      <td>1</td>\n",
       "      <td>...</td>\n",
       "      <td>2</td>\n",
       "      <td>80</td>\n",
       "      <td>0</td>\n",
       "      <td>7</td>\n",
       "      <td>3</td>\n",
       "      <td>3</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>3</td>\n",
       "      <td>33</td>\n",
       "      <td>No</td>\n",
       "      <td>Travel_Frequently</td>\n",
       "      <td>1392</td>\n",
       "      <td>Research &amp; Development</td>\n",
       "      <td>3</td>\n",
       "      <td>4</td>\n",
       "      <td>Life Sciences</td>\n",
       "      <td>1</td>\n",
       "      <td>...</td>\n",
       "      <td>3</td>\n",
       "      <td>80</td>\n",
       "      <td>0</td>\n",
       "      <td>8</td>\n",
       "      <td>3</td>\n",
       "      <td>3</td>\n",
       "      <td>8</td>\n",
       "      <td>7</td>\n",
       "      <td>3</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>4</td>\n",
       "      <td>27</td>\n",
       "      <td>No</td>\n",
       "      <td>Travel_Rarely</td>\n",
       "      <td>591</td>\n",
       "      <td>Research &amp; Development</td>\n",
       "      <td>2</td>\n",
       "      <td>1</td>\n",
       "      <td>Medical</td>\n",
       "      <td>1</td>\n",
       "      <td>...</td>\n",
       "      <td>4</td>\n",
       "      <td>80</td>\n",
       "      <td>1</td>\n",
       "      <td>6</td>\n",
       "      <td>3</td>\n",
       "      <td>3</td>\n",
       "      <td>2</td>\n",
       "      <td>2</td>\n",
       "      <td>2</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>5 rows × 36 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "   Unnamed: 0  Age Attrition     BusinessTravel  DailyRate  \\\n",
       "0           0   41       Yes      Travel_Rarely       1102   \n",
       "1           1   49        No  Travel_Frequently        279   \n",
       "2           2   37       Yes      Travel_Rarely       1373   \n",
       "3           3   33        No  Travel_Frequently       1392   \n",
       "4           4   27        No      Travel_Rarely        591   \n",
       "\n",
       "               Department  DistanceFromHome  Education EducationField  \\\n",
       "0                   Sales                 1          2  Life Sciences   \n",
       "1  Research & Development                 8          1  Life Sciences   \n",
       "2  Research & Development                 2          2          Other   \n",
       "3  Research & Development                 3          4  Life Sciences   \n",
       "4  Research & Development                 2          1        Medical   \n",
       "\n",
       "   EmployeeCount  ...  RelationshipSatisfaction  StandardHours  \\\n",
       "0              1  ...                         1             80   \n",
       "1              1  ...                         4             80   \n",
       "2              1  ...                         2             80   \n",
       "3              1  ...                         3             80   \n",
       "4              1  ...                         4             80   \n",
       "\n",
       "  StockOptionLevel  TotalWorkingYears  TrainingTimesLastYear  WorkLifeBalance  \\\n",
       "0                0                  8                      0                1   \n",
       "1                1                 10                      3                3   \n",
       "2                0                  7                      3                3   \n",
       "3                0                  8                      3                3   \n",
       "4                1                  6                      3                3   \n",
       "\n",
       "  YearsAtCompany  YearsInCurrentRole YearsSinceLastPromotion  \\\n",
       "0              6                   4                       0   \n",
       "1             10                   7                       1   \n",
       "2              0                   0                       0   \n",
       "3              8                   7                       3   \n",
       "4              2                   2                       2   \n",
       "\n",
       "   YearsWithCurrManager  \n",
       "0                     5  \n",
       "1                     7  \n",
       "2                     0  \n",
       "3                     0  \n",
       "4                     2  \n",
       "\n",
       "[5 rows x 36 columns]"
      ]
     },
     "execution_count": 125,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# imports\n",
    "import pandas as pd\n",
    "\n",
    "# updated 2019-08-13\n",
    "# IBM has removed the file from their server\n",
    "\n",
    "# deptecated code\n",
    "# read the file \n",
    "# url = \"https://community.watsonanalytics.com/wp-content/uploads/2015/03/WA_Fn-UseC_-HR-Employee-Attrition.xlsx\"\n",
    "# empl_data = pd.read_excel(url)\n",
    "\n",
    "# read local file for demonstration\n",
    "file = 'Dropbox/WFA/data/WA_Fn-UseC_-HR-Employee-Attrition.xlsx'\n",
    "empl_data = pd.read_excel(file)\n",
    "empl_data.head()\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This is great to get us started, and gave us a dataset that many others had used - in blog posts, Kaggle competitions, and otherwise. Now, we're ready for more and would like to generate our own dataset for continued development and exploration."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<h2>pydbgen</h2>\n",
    "<b>Random Database/Dataframe Generator</b>\n",
    "\n",
    "github: https://github.com/tirthajyoti/pydbgen\n",
    "read the docs: https://pydbgen.readthedocs.io/en/latest/\n",
    "\n",
    "from pydbgen documentation:\n",
    "</i>Often, beginners in SQL or data science struggle with the matter of easy access to a large sample database file (.DB or .sqlite) for practicing SQL commands. Would it not be great to have a simple tool or library to generate a large database with multiple tables, filled with data of one’s own choice?\n",
    "\n",
    "After all, databases break every now and then and it is safest to practice with a randomly generated one :-)</i>\n",
    "\n",
    "That sums it up very well - we need data to practice on, and in a safe way. <u>Especially</u> when we're dealing with PII and senstive data, as we are regularly in HR. It's so commonplace that some, unfortunately, are densensitized to the senstitive nature and requirements, and make a blunder posting to an S3 bin or a similiar, but disastrous mistake.\n",
    "\n",
    "Generating our own fake data protects us from ourselves. pydbgen allows us to do this very quickly, and generates very realistic data."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<h3>Installing pydbgen</h3>\n",
    "As of this writing in August 2019, pydbgen is not available on <code>conda</code> (my preferred installation method). \n",
    "\n",
    "On both Windows and Linux use <code>pip</code>."
   ]
  },
  {
   "cell_type": "raw",
   "metadata": {},
   "source": [
    "pip install pydbgen"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 126,
   "metadata": {},
   "outputs": [],
   "source": [
    "# load pydbgen\n",
    "\n",
    "import pydbgen\n",
    "from pydbgen import pydbgen"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 127,
   "metadata": {},
   "outputs": [],
   "source": [
    "db = pydbgen.pydb()\n",
    "\n",
    "df = db.gen_dataframe(num=100, fields=['name', 'street_address', 'city', 'state', 'zipcode', 'country', 'company', 'job_title', 'phone', 'ssn', 'email', 'month', 'year', 'weekday', 'date', 'time', 'latitude', 'longitude', 'license_plate'], )"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 128,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>name</th>\n",
       "      <th>street_address</th>\n",
       "      <th>city</th>\n",
       "      <th>state</th>\n",
       "      <th>zipcode</th>\n",
       "      <th>country</th>\n",
       "      <th>company</th>\n",
       "      <th>job_title</th>\n",
       "      <th>phone-number</th>\n",
       "      <th>ssn</th>\n",
       "      <th>email</th>\n",
       "      <th>month</th>\n",
       "      <th>year</th>\n",
       "      <th>weekday</th>\n",
       "      <th>date</th>\n",
       "      <th>time</th>\n",
       "      <th>latitude</th>\n",
       "      <th>longitude</th>\n",
       "      <th>license-plate</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>William Anderson</td>\n",
       "      <td>7953 Mandy Turnpike</td>\n",
       "      <td>Swain</td>\n",
       "      <td>Texas</td>\n",
       "      <td>78280</td>\n",
       "      <td>Myanmar</td>\n",
       "      <td>Brown-Vasquez</td>\n",
       "      <td>Scientific laboratory technician</td>\n",
       "      <td>352-368-1239</td>\n",
       "      <td>228-58-3135</td>\n",
       "      <td>WAnderson@datapluspeople.com</td>\n",
       "      <td>None</td>\n",
       "      <td>1999</td>\n",
       "      <td>Monday</td>\n",
       "      <td>2002-11-17</td>\n",
       "      <td>10:39:23</td>\n",
       "      <td>-6.9832325</td>\n",
       "      <td>-30.181752</td>\n",
       "      <td>BIH-274</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Dawn Molina</td>\n",
       "      <td>554 Heather Turnpike Apt. 311</td>\n",
       "      <td>Pepin</td>\n",
       "      <td>Oklahoma</td>\n",
       "      <td>75571</td>\n",
       "      <td>Malaysia</td>\n",
       "      <td>Martinez, Thomas and Henry</td>\n",
       "      <td>Chartered accountant</td>\n",
       "      <td>245-361-8447</td>\n",
       "      <td>252-39-2457</td>\n",
       "      <td>Dawn.Molina@datapluspeople.com</td>\n",
       "      <td>None</td>\n",
       "      <td>1990</td>\n",
       "      <td>Friday</td>\n",
       "      <td>2015-10-26</td>\n",
       "      <td>01:48:58</td>\n",
       "      <td>41.9638895</td>\n",
       "      <td>-33.070358</td>\n",
       "      <td>EYV-268</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Timothy Alexander</td>\n",
       "      <td>6693 Donald Plain</td>\n",
       "      <td>Moore</td>\n",
       "      <td>Delaware</td>\n",
       "      <td>94146</td>\n",
       "      <td>Chad</td>\n",
       "      <td>Diaz-Bruce</td>\n",
       "      <td>Camera operator</td>\n",
       "      <td>701-463-6626</td>\n",
       "      <td>602-26-0601</td>\n",
       "      <td>Alexander_Timothy67@datapluspeople.com</td>\n",
       "      <td>None</td>\n",
       "      <td>1991</td>\n",
       "      <td>Saturday</td>\n",
       "      <td>2009-12-31</td>\n",
       "      <td>01:51:05</td>\n",
       "      <td>-46.888624</td>\n",
       "      <td>-32.441572</td>\n",
       "      <td>AAN-6293</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Bradley Walter</td>\n",
       "      <td>24543 Adams Fort</td>\n",
       "      <td>Sydney</td>\n",
       "      <td>Indiana</td>\n",
       "      <td>33266</td>\n",
       "      <td>Ecuador</td>\n",
       "      <td>Jackson-Lang</td>\n",
       "      <td>Company secretary</td>\n",
       "      <td>420-550-7054</td>\n",
       "      <td>563-67-3139</td>\n",
       "      <td>BradleyWalter94@datapluspeople.com</td>\n",
       "      <td>None</td>\n",
       "      <td>1970</td>\n",
       "      <td>Thursday</td>\n",
       "      <td>1979-02-10</td>\n",
       "      <td>22:24:59</td>\n",
       "      <td>-7.668391</td>\n",
       "      <td>-166.274743</td>\n",
       "      <td>8QSM719</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Daniel Allen</td>\n",
       "      <td>9189 Cynthia Ramp</td>\n",
       "      <td>Noblestown</td>\n",
       "      <td>Kentucky</td>\n",
       "      <td>76651</td>\n",
       "      <td>France</td>\n",
       "      <td>Glass PLC</td>\n",
       "      <td>Biochemist, clinical</td>\n",
       "      <td>538-078-0566</td>\n",
       "      <td>533-98-1206</td>\n",
       "      <td>Daniel.Allen@datapluspeople.com</td>\n",
       "      <td>None</td>\n",
       "      <td>1978</td>\n",
       "      <td>Sunday</td>\n",
       "      <td>1978-06-02</td>\n",
       "      <td>15:35:34</td>\n",
       "      <td>-24.511857</td>\n",
       "      <td>-35.220806</td>\n",
       "      <td>CTZ-3918</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                name                 street_address        city     state  \\\n",
       "0   William Anderson            7953 Mandy Turnpike       Swain     Texas   \n",
       "1        Dawn Molina  554 Heather Turnpike Apt. 311       Pepin  Oklahoma   \n",
       "2  Timothy Alexander              6693 Donald Plain       Moore  Delaware   \n",
       "3     Bradley Walter               24543 Adams Fort      Sydney   Indiana   \n",
       "4       Daniel Allen              9189 Cynthia Ramp  Noblestown  Kentucky   \n",
       "\n",
       "  zipcode   country                     company  \\\n",
       "0   78280   Myanmar               Brown-Vasquez   \n",
       "1   75571  Malaysia  Martinez, Thomas and Henry   \n",
       "2   94146      Chad                  Diaz-Bruce   \n",
       "3   33266   Ecuador                Jackson-Lang   \n",
       "4   76651    France                   Glass PLC   \n",
       "\n",
       "                          job_title  phone-number          ssn  \\\n",
       "0  Scientific laboratory technician  352-368-1239  228-58-3135   \n",
       "1              Chartered accountant  245-361-8447  252-39-2457   \n",
       "2                   Camera operator  701-463-6626  602-26-0601   \n",
       "3                 Company secretary  420-550-7054  563-67-3139   \n",
       "4              Biochemist, clinical  538-078-0566  533-98-1206   \n",
       "\n",
       "                                    email month  year   weekday        date  \\\n",
       "0            WAnderson@datapluspeople.com  None  1999    Monday  2002-11-17   \n",
       "1          Dawn.Molina@datapluspeople.com  None  1990    Friday  2015-10-26   \n",
       "2  Alexander_Timothy67@datapluspeople.com  None  1991  Saturday  2009-12-31   \n",
       "3      BradleyWalter94@datapluspeople.com  None  1970  Thursday  1979-02-10   \n",
       "4         Daniel.Allen@datapluspeople.com  None  1978    Sunday  1978-06-02   \n",
       "\n",
       "       time    latitude    longitude license-plate  \n",
       "0  10:39:23  -6.9832325   -30.181752       BIH-274  \n",
       "1  01:48:58  41.9638895   -33.070358       EYV-268  \n",
       "2  01:51:05  -46.888624   -32.441572      AAN-6293  \n",
       "3  22:24:59   -7.668391  -166.274743       8QSM719  \n",
       "4  15:35:34  -24.511857   -35.220806      CTZ-3918  "
      ]
     },
     "execution_count": 128,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<h3>pydbgen Summary</h3>\n",
    "Overall, pydbgen is a great, quick way to generate any amount of data. The limitation is primarily the data types that are supported currently. The fields shown above in this example are the extent of fields available as of this writing. These are a great start and for certain situations, these are more than you need. A field such as License Plate is a nice inclusion.\n",
    "\n",
    "The documentation a little lacking. For example, the documentation (as of this writing) does not mention the 'Domains.txt' file required to generate email addresses. The documentation, however, does point us to <code>Faker</code> - which <code>pydbgen</code> builds upon to generate the fata (fake data). We'll explore <code>Faker</code> in the next section."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<h2>Faker</h2>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": 129,
   "metadata": {},
   "outputs": [],
   "source": [
    "from faker import Faker"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 130,
   "metadata": {},
   "outputs": [],
   "source": [
    "fake = Faker()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 131,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'Michael Morris'"
      ]
     },
     "execution_count": 131,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "fake.name()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 132,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'6928 Richard Fort Suite 784\\nEast Nicole, SC 52141'"
      ]
     },
     "execution_count": 132,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "fake.address()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "fake."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "fake_df = pd.DataFrame(columns = ['name', 'ssn'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "name_list = []\n",
    "ssn_list = []\n",
    "dob_list = []\n",
    "address_list = []\n",
    "city_list = []\n",
    "state_list = []\n",
    "country_list = []\n",
    "postal_list = []\n",
    "id_list = []\n",
    "email_list = []\n",
    "username_list = []\n",
    "\n",
    "\n",
    "for i in range(1000):\n",
    "    name_list.append(fake.name())\n",
    "    ssn_list.append(fake.ssn())\n",
    "    dob_list.append(fake.date_of_birth())\n",
    "    address_list.append(fake.street_address())\n",
    "    city_list.append(fake.city())\n",
    "    state_list.append(fake.state_abbr())\n",
    "    country_list.append(fake.country_code())\n",
    "    postal_list.append(fake.postalcode())\n",
    "    email_list.append(fake.email())\n",
    "    id_list.append(fake.random_int())\n",
    "    username_list.append(fake.user_name())\n",
    "    \n",
    "    \n",
    "    \n",
    "fake_df['name'] = name_list\n",
    "fake_df['ssn'] = ssn_list\n",
    "fake_df['dob'] = dob_list\n",
    "fake_df['address'] = address_list\n",
    "fake_df['city'] = city_list\n",
    "fake_df['state'] = state_list\n",
    "fake_df['country'] = country_list\n",
    "fake_df['postal'] = postal_list\n",
    "fake_df['id'] = id_list\n",
    "fake_df['email'] = email_list\n",
    "fake_df['username'] = username_list"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "fake_df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<h3>Customizing with <code>Faker</code></h3>\n",
    "\n",
    "<code>Faker</code> allows for the creation of your own providers."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from faker.providers import BaseProvider\n",
    "import random\n",
    "\n",
    "# create the provider. The class name for Faker must be 'Provider'\n",
    "class Provider(BaseProvider):\n",
    "    def gender(self):\n",
    "        num = random.randint(0,1)\n",
    "        if num == 0:\n",
    "            return 'Male'\n",
    "        else:\n",
    "            return 'Female'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "fake.add_provider(Provider)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "fake.gender()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# add this to the DataFrame\n",
    "gender_list = []\n",
    "\n",
    "for i in range(1000):\n",
    "    gender_list.append(fake.gender())\n",
    "\n",
    "fake_df['gender'] = gender_list\n",
    "\n",
    "fake_df['gender'].head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "fake_df.info()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# convert gender column to category\n",
    "fake_df['gender'] = fake_df['gender'].astype('category')\n",
    "\n",
    "fake_df.info()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "fake_df['gender'].head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#export the file\n",
    "fake_df.to_csv('~/Downloads/FATA.csv')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<h2>Conclusion</h2>\n",
    "\n",
    "This tutorial showed a few ways in which we can generate fake data - FATA - to allow us to continue to explore and analyze HR data. You could combine this to anonymize your real HR data, to be able to include names, ssn's, etc. all without compromising one of the most fundamental parts of working with HR data - privacy and respect of people's information."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
	{
	"cells": [
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"<h1>GENERATING FATA</h1>\n",
	"<h2>Fake dATA</h2>"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"<h2>Intro</h2>\n",
	"<h3>IBM Watson Data</h3>\n",
	"\n",
	"For many of the early workbooks here, we've stood on the shoulders of others. We simply imported a dataset that was created for the Watson HR Analytics work. "
	]
	},
	{
	"cell_type": "code",
	"execution_count": 125,
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/html": [
	"<div>\n",
	"<style scoped>\n",
	" .dataframe tbody tr th:only-of-type {\n",
	" vertical-align: middle;\n",
	" }\n",
	"\n",
	" .dataframe tbody tr th {\n",
	" vertical-align: top;\n",
	" }\n",
	"\n",
	" .dataframe thead th {\n",
	" text-align: right;\n",
	" }\n",
	"</style>\n",
	"<table border=\"1\" class=\"dataframe\">\n",
	" <thead>\n",
	" <tr style=\"text-align: right;\">\n",
	" <th></th>\n",
	" <th>Unnamed: 0</th>\n",
	" <th>Age</th>\n",
	" <th>Attrition</th>\n",
	" <th>BusinessTravel</th>\n",
	" <th>DailyRate</th>\n",
	" <th>Department</th>\n",
	" <th>DistanceFromHome</th>\n",
	" <th>Education</th>\n",
	" <th>EducationField</th>\n",
	" <th>EmployeeCount</th>\n",
	" <th>...</th>\n",
	" <th>RelationshipSatisfaction</th>\n",
	" <th>StandardHours</th>\n",
	" <th>StockOptionLevel</th>\n",
	" <th>TotalWorkingYears</th>\n",
	" <th>TrainingTimesLastYear</th>\n",
	" <th>WorkLifeBalance</th>\n",
	" <th>YearsAtCompany</th>\n",
	" <th>YearsInCurrentRole</th>\n",
	" <th>YearsSinceLastPromotion</th>\n",
	" <th>YearsWithCurrManager</th>\n",
	" </tr>\n",
	" </thead>\n",
	" <tbody>\n",
	" <tr>\n",
	" <th>0</th>\n",
	" <td>0</td>\n",
	" <td>41</td>\n",
	" <td>Yes</td>\n",
	" <td>Travel_Rarely</td>\n",
	" <td>1102</td>\n",
	" <td>Sales</td>\n",
	" <td>1</td>\n",
	" <td>2</td>\n",
	" <td>Life Sciences</td>\n",
	" <td>1</td>\n",
	" <td>...</td>\n",
	" <td>1</td>\n",
	" <td>80</td>\n",
	" <td>0</td>\n",
	" <td>8</td>\n",
	" <td>0</td>\n",
	" <td>1</td>\n",
	" <td>6</td>\n",
	" <td>4</td>\n",
	" <td>0</td>\n",
	" <td>5</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>1</th>\n",
	" <td>1</td>\n",
	" <td>49</td>\n",
	" <td>No</td>\n",
	" <td>Travel_Frequently</td>\n",
	" <td>279</td>\n",
	" <td>Research & Development</td>\n",
	" <td>8</td>\n",
	" <td>1</td>\n",
	" <td>Life Sciences</td>\n",
	" <td>1</td>\n",
	" <td>...</td>\n",
	" <td>4</td>\n",
	" <td>80</td>\n",
	" <td>1</td>\n",
	" <td>10</td>\n",
	" <td>3</td>\n",
	" <td>3</td>\n",
	" <td>10</td>\n",
	" <td>7</td>\n",
	" <td>1</td>\n",
	" <td>7</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>2</th>\n",
	" <td>2</td>\n",
	" <td>37</td>\n",
	" <td>Yes</td>\n",
	" <td>Travel_Rarely</td>\n",
	" <td>1373</td>\n",
	" <td>Research & Development</td>\n",
	" <td>2</td>\n",
	" <td>2</td>\n",
	" <td>Other</td>\n",
	" <td>1</td>\n",
	" <td>...</td>\n",
	" <td>2</td>\n",
	" <td>80</td>\n",
	" <td>0</td>\n",
	" <td>7</td>\n",
	" <td>3</td>\n",
	" <td>3</td>\n",
	" <td>0</td>\n",
	" <td>0</td>\n",
	" <td>0</td>\n",
	" <td>0</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>3</th>\n",
	" <td>3</td>\n",
	" <td>33</td>\n",
	" <td>No</td>\n",
	" <td>Travel_Frequently</td>\n",
	" <td>1392</td>\n",
	" <td>Research & Development</td>\n",
	" <td>3</td>\n",
	" <td>4</td>\n",
	" <td>Life Sciences</td>\n",
	" <td>1</td>\n",
	" <td>...</td>\n",
	" <td>3</td>\n",
	" <td>80</td>\n",
	" <td>0</td>\n",
	" <td>8</td>\n",
	" <td>3</td>\n",
	" <td>3</td>\n",
	" <td>8</td>\n",
	" <td>7</td>\n",
	" <td>3</td>\n",
	" <td>0</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>4</th>\n",
	" <td>4</td>\n",
	" <td>27</td>\n",
	" <td>No</td>\n",
	" <td>Travel_Rarely</td>\n",
	" <td>591</td>\n",
	" <td>Research & Development</td>\n",
	" <td>2</td>\n",
	" <td>1</td>\n",
	" <td>Medical</td>\n",
	" <td>1</td>\n",
	" <td>...</td>\n",
	" <td>4</td>\n",
	" <td>80</td>\n",
	" <td>1</td>\n",
	" <td>6</td>\n",
	" <td>3</td>\n",
	" <td>3</td>\n",
	" <td>2</td>\n",
	" <td>2</td>\n",
	" <td>2</td>\n",
	" <td>2</td>\n",
	" </tr>\n",
	" </tbody>\n",
	"</table>\n",
	"<p>5 rows × 36 columns</p>\n",
	"</div>"
	],
	"text/plain": [
	" Unnamed: 0 Age Attrition BusinessTravel DailyRate \\\n",
	"0 0 41 Yes Travel_Rarely 1102 \n",
	"1 1 49 No Travel_Frequently 279 \n",
	"2 2 37 Yes Travel_Rarely 1373 \n",
	"3 3 33 No Travel_Frequently 1392 \n",
	"4 4 27 No Travel_Rarely 591 \n",
	"\n",
	" Department DistanceFromHome Education EducationField \\\n",
	"0 Sales 1 2 Life Sciences \n",
	"1 Research & Development 8 1 Life Sciences \n",
	"2 Research & Development 2 2 Other \n",
	"3 Research & Development 3 4 Life Sciences \n",
	"4 Research & Development 2 1 Medical \n",
	"\n",
	" EmployeeCount ... RelationshipSatisfaction StandardHours \\\n",
	"0 1 ... 1 80 \n",
	"1 1 ... 4 80 \n",
	"2 1 ... 2 80 \n",
	"3 1 ... 3 80 \n",
	"4 1 ... 4 80 \n",
	"\n",
	" StockOptionLevel TotalWorkingYears TrainingTimesLastYear WorkLifeBalance \\\n",
	"0 0 8 0 1 \n",
	"1 1 10 3 3 \n",
	"2 0 7 3 3 \n",
	"3 0 8 3 3 \n",
	"4 1 6 3 3 \n",
	"\n",
	" YearsAtCompany YearsInCurrentRole YearsSinceLastPromotion \\\n",
	"0 6 4 0 \n",
	"1 10 7 1 \n",
	"2 0 0 0 \n",
	"3 8 7 3 \n",
	"4 2 2 2 \n",
	"\n",
	" YearsWithCurrManager \n",
	"0 5 \n",
	"1 7 \n",
	"2 0 \n",
	"3 0 \n",
	"4 2 \n",
	"\n",
	"[5 rows x 36 columns]"
	]
	},
	"execution_count": 125,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"# imports\n",
	"import pandas as pd\n",
	"\n",
	"# updated 2019-08-13\n",
	"# IBM has removed the file from their server\n",
	"\n",
	"# deptecated code\n",
	"# read the file \n",
	"# url = \"https://community.watsonanalytics.com/wp-content/uploads/2015/03/WA_Fn-UseC_-HR-Employee-Attrition.xlsx\"\n",
	"# empl_data = pd.read_excel(url)\n",
	"\n",
	"# read local file for demonstration\n",
	"file = 'Dropbox/WFA/data/WA_Fn-UseC_-HR-Employee-Attrition.xlsx'\n",
	"empl_data = pd.read_excel(file)\n",
	"empl_data.head()\n"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"This is great to get us started, and gave us a dataset that many others had used - in blog posts, Kaggle competitions, and otherwise. Now, we're ready for more and would like to generate our own dataset for continued development and exploration."
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"<h2>pydbgen</h2>\n",
	"<b>Random Database/Dataframe Generator</b>\n",
	"\n",
	"github: https://github.com/tirthajyoti/pydbgen\n",
	"read the docs: https://pydbgen.readthedocs.io/en/latest/\n",
	"\n",
	"from pydbgen documentation:\n",
	"</i>Often, beginners in SQL or data science struggle with the matter of easy access to a large sample database file (.DB or .sqlite) for practicing SQL commands. Would it not be great to have a simple tool or library to generate a large database with multiple tables, filled with data of one’s own choice?\n",
	"\n",
	"After all, databases break every now and then and it is safest to practice with a randomly generated one :-)</i>\n",
	"\n",
	"That sums it up very well - we need data to practice on, and in a safe way. <u>Especially</u> when we're dealing with PII and senstive data, as we are regularly in HR. It's so commonplace that some, unfortunately, are densensitized to the senstitive nature and requirements, and make a blunder posting to an S3 bin or a similiar, but disastrous mistake.\n",
	"\n",
	"Generating our own fake data protects us from ourselves. pydbgen allows us to do this very quickly, and generates very realistic data."
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"<h3>Installing pydbgen</h3>\n",
	"As of this writing in August 2019, pydbgen is not available on <code>conda</code> (my preferred installation method). \n",
	"\n",
	"On both Windows and Linux use <code>pip</code>."
	]
	},
	{
	"cell_type": "raw",
	"metadata": {},
	"source": [
	"pip install pydbgen"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 126,
	"metadata": {},
	"outputs": [],
	"source": [
	"# load pydbgen\n",
	"\n",
	"import pydbgen\n",
	"from pydbgen import pydbgen"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 127,
	"metadata": {},
	"outputs": [],
	"source": [
	"db = pydbgen.pydb()\n",
	"\n",
	"df = db.gen_dataframe(num=100, fields=['name', 'street_address', 'city', 'state', 'zipcode', 'country', 'company', 'job_title', 'phone', 'ssn', 'email', 'month', 'year', 'weekday', 'date', 'time', 'latitude', 'longitude', 'license_plate'], )"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 128,
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/html": [
	"<div>\n",
	"<style scoped>\n",
	" .dataframe tbody tr th:only-of-type {\n",
	" vertical-align: middle;\n",
	" }\n",
	"\n",
	" .dataframe tbody tr th {\n",
	" vertical-align: top;\n",
	" }\n",
	"\n",
	" .dataframe thead th {\n",
	" text-align: right;\n",
	" }\n",
	"</style>\n",
	"<table border=\"1\" class=\"dataframe\">\n",
	" <thead>\n",
	" <tr style=\"text-align: right;\">\n",
	" <th></th>\n",
	" <th>name</th>\n",
	" <th>street_address</th>\n",
	" <th>city</th>\n",
	" <th>state</th>\n",
	" <th>zipcode</th>\n",
	" <th>country</th>\n",
	" <th>company</th>\n",
	" <th>job_title</th>\n",
	" <th>phone-number</th>\n",
	" <th>ssn</th>\n",
	" <th>email</th>\n",
	" <th>month</th>\n",
	" <th>year</th>\n",
	" <th>weekday</th>\n",
	" <th>date</th>\n",
	" <th>time</th>\n",
	" <th>latitude</th>\n",
	" <th>longitude</th>\n",
	" <th>license-plate</th>\n",
	" </tr>\n",
	" </thead>\n",
	" <tbody>\n",
	" <tr>\n",
	" <th>0</th>\n",
	" <td>William Anderson</td>\n",
	" <td>7953 Mandy Turnpike</td>\n",
	" <td>Swain</td>\n",
	" <td>Texas</td>\n",
	" <td>78280</td>\n",
	" <td>Myanmar</td>\n",
	" <td>Brown-Vasquez</td>\n",
	" <td>Scientific laboratory technician</td>\n",
	" <td>352-368-1239</td>\n",
	" <td>228-58-3135</td>\n",
	" <td>WAnderson@datapluspeople.com</td>\n",
	" <td>None</td>\n",
	" <td>1999</td>\n",
	" <td>Monday</td>\n",
	" <td>2002-11-17</td>\n",
	" <td>10:39:23</td>\n",
	" <td>-6.9832325</td>\n",
	" <td>-30.181752</td>\n",
	" <td>BIH-274</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>1</th>\n",
	" <td>Dawn Molina</td>\n",
	" <td>554 Heather Turnpike Apt. 311</td>\n",
	" <td>Pepin</td>\n",
	" <td>Oklahoma</td>\n",
	" <td>75571</td>\n",
	" <td>Malaysia</td>\n",
	" <td>Martinez, Thomas and Henry</td>\n",
	" <td>Chartered accountant</td>\n",
	" <td>245-361-8447</td>\n",
	" <td>252-39-2457</td>\n",
	" <td>Dawn.Molina@datapluspeople.com</td>\n",
	" <td>None</td>\n",
	" <td>1990</td>\n",
	" <td>Friday</td>\n",
	" <td>2015-10-26</td>\n",
	" <td>01:48:58</td>\n",
	" <td>41.9638895</td>\n",
	" <td>-33.070358</td>\n",
	" <td>EYV-268</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>2</th>\n",
	" <td>Timothy Alexander</td>\n",
	" <td>6693 Donald Plain</td>\n",
	" <td>Moore</td>\n",
	" <td>Delaware</td>\n",
	" <td>94146</td>\n",
	" <td>Chad</td>\n",
	" <td>Diaz-Bruce</td>\n",
	" <td>Camera operator</td>\n",
	" <td>701-463-6626</td>\n",
	" <td>602-26-0601</td>\n",
	" <td>Alexander_Timothy67@datapluspeople.com</td>\n",
	" <td>None</td>\n",
	" <td>1991</td>\n",
	" <td>Saturday</td>\n",
	" <td>2009-12-31</td>\n",
	" <td>01:51:05</td>\n",
	" <td>-46.888624</td>\n",
	" <td>-32.441572</td>\n",
	" <td>AAN-6293</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>3</th>\n",
	" <td>Bradley Walter</td>\n",
	" <td>24543 Adams Fort</td>\n",
	" <td>Sydney</td>\n",
	" <td>Indiana</td>\n",
	" <td>33266</td>\n",
	" <td>Ecuador</td>\n",
	" <td>Jackson-Lang</td>\n",
	" <td>Company secretary</td>\n",
	" <td>420-550-7054</td>\n",
	" <td>563-67-3139</td>\n",
	" <td>BradleyWalter94@datapluspeople.com</td>\n",
	" <td>None</td>\n",
	" <td>1970</td>\n",
	" <td>Thursday</td>\n",
	" <td>1979-02-10</td>\n",
	" <td>22:24:59</td>\n",
	" <td>-7.668391</td>\n",
	" <td>-166.274743</td>\n",
	" <td>8QSM719</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>4</th>\n",
	" <td>Daniel Allen</td>\n",
	" <td>9189 Cynthia Ramp</td>\n",
	" <td>Noblestown</td>\n",
	" <td>Kentucky</td>\n",
	" <td>76651</td>\n",
	" <td>France</td>\n",
	" <td>Glass PLC</td>\n",
	" <td>Biochemist, clinical</td>\n",
	" <td>538-078-0566</td>\n",
	" <td>533-98-1206</td>\n",
	" <td>Daniel.Allen@datapluspeople.com</td>\n",
	" <td>None</td>\n",
	" <td>1978</td>\n",
	" <td>Sunday</td>\n",
	" <td>1978-06-02</td>\n",
	" <td>15:35:34</td>\n",
	" <td>-24.511857</td>\n",
	" <td>-35.220806</td>\n",
	" <td>CTZ-3918</td>\n",
	" </tr>\n",
	" </tbody>\n",
	"</table>\n",
	"</div>"
	],
	"text/plain": [
	" name street_address city state \\\n",
	"0 William Anderson 7953 Mandy Turnpike Swain Texas \n",
	"1 Dawn Molina 554 Heather Turnpike Apt. 311 Pepin Oklahoma \n",
	"2 Timothy Alexander 6693 Donald Plain Moore Delaware \n",
	"3 Bradley Walter 24543 Adams Fort Sydney Indiana \n",
	"4 Daniel Allen 9189 Cynthia Ramp Noblestown Kentucky \n",
	"\n",
	" zipcode country company \\\n",
	"0 78280 Myanmar Brown-Vasquez \n",
	"1 75571 Malaysia Martinez, Thomas and Henry \n",
	"2 94146 Chad Diaz-Bruce \n",
	"3 33266 Ecuador Jackson-Lang \n",
	"4 76651 France Glass PLC \n",
	"\n",
	" job_title phone-number ssn \\\n",
	"0 Scientific laboratory technician 352-368-1239 228-58-3135 \n",
	"1 Chartered accountant 245-361-8447 252-39-2457 \n",
	"2 Camera operator 701-463-6626 602-26-0601 \n",
	"3 Company secretary 420-550-7054 563-67-3139 \n",
	"4 Biochemist, clinical 538-078-0566 533-98-1206 \n",
	"\n",
	" email month year weekday date \\\n",
	"0 WAnderson@datapluspeople.com None 1999 Monday 2002-11-17 \n",
	"1 Dawn.Molina@datapluspeople.com None 1990 Friday 2015-10-26 \n",
	"2 Alexander_Timothy67@datapluspeople.com None 1991 Saturday 2009-12-31 \n",
	"3 BradleyWalter94@datapluspeople.com None 1970 Thursday 1979-02-10 \n",
	"4 Daniel.Allen@datapluspeople.com None 1978 Sunday 1978-06-02 \n",
	"\n",
	" time latitude longitude license-plate \n",
	"0 10:39:23 -6.9832325 -30.181752 BIH-274 \n",
	"1 01:48:58 41.9638895 -33.070358 EYV-268 \n",
	"2 01:51:05 -46.888624 -32.441572 AAN-6293 \n",
	"3 22:24:59 -7.668391 -166.274743 8QSM719 \n",
	"4 15:35:34 -24.511857 -35.220806 CTZ-3918 "
	]
	},
	"execution_count": 128,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"df.head()"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"<h3>pydbgen Summary</h3>\n",
	"Overall, pydbgen is a great, quick way to generate any amount of data. The limitation is primarily the data types that are supported currently. The fields shown above in this example are the extent of fields available as of this writing. These are a great start and for certain situations, these are more than you need. A field such as License Plate is a nice inclusion.\n",
	"\n",
	"The documentation a little lacking. For example, the documentation (as of this writing) does not mention the 'Domains.txt' file required to generate email addresses. The documentation, however, does point us to <code>Faker</code> - which <code>pydbgen</code> builds upon to generate the fata (fake data). We'll explore <code>Faker</code> in the next section."
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"<h2>Faker</h2>"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": []
	},
	{
	"cell_type": "code",
	"execution_count": 129,
	"metadata": {},
	"outputs": [],
	"source": [
	"from faker import Faker"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 130,
	"metadata": {},
	"outputs": [],
	"source": [
	"fake = Faker()"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 131,
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"'Michael Morris'"
	]
	},
	"execution_count": 131,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"fake.name()"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 132,
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"'6928 Richard Fort Suite 784\\nEast Nicole, SC 52141'"
	]
	},
	"execution_count": 132,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"fake.address()"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": []
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": [
	"fake."
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": [
	"fake_df = pd.DataFrame(columns = ['name', 'ssn'])"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": [
	"name_list = []\n",
	"ssn_list = []\n",
	"dob_list = []\n",
	"address_list = []\n",
	"city_list = []\n",
	"state_list = []\n",
	"country_list = []\n",
	"postal_list = []\n",
	"id_list = []\n",
	"email_list = []\n",
	"username_list = []\n",
	"\n",
	"\n",
	"for i in range(1000):\n",
	" name_list.append(fake.name())\n",
	" ssn_list.append(fake.ssn())\n",
	" dob_list.append(fake.date_of_birth())\n",
	" address_list.append(fake.street_address())\n",
	" city_list.append(fake.city())\n",
	" state_list.append(fake.state_abbr())\n",
	" country_list.append(fake.country_code())\n",
	" postal_list.append(fake.postalcode())\n",
	" email_list.append(fake.email())\n",
	" id_list.append(fake.random_int())\n",
	" username_list.append(fake.user_name())\n",
	" \n",
	" \n",
	" \n",
	"fake_df['name'] = name_list\n",
	"fake_df['ssn'] = ssn_list\n",
	"fake_df['dob'] = dob_list\n",
	"fake_df['address'] = address_list\n",
	"fake_df['city'] = city_list\n",
	"fake_df['state'] = state_list\n",
	"fake_df['country'] = country_list\n",
	"fake_df['postal'] = postal_list\n",
	"fake_df['id'] = id_list\n",
	"fake_df['email'] = email_list\n",
	"fake_df['username'] = username_list"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": [
	"fake_df"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"<h3>Customizing with <code>Faker</code></h3>\n",
	"\n",
	"<code>Faker</code> allows for the creation of your own providers."
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": [
	"from faker.providers import BaseProvider\n",
	"import random\n",
	"\n",
	"# create the provider. The class name for Faker must be 'Provider'\n",
	"class Provider(BaseProvider):\n",
	" def gender(self):\n",
	" num = random.randint(0,1)\n",
	" if num == 0:\n",
	" return 'Male'\n",
	" else:\n",
	" return 'Female'"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": [
	"fake.add_provider(Provider)"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": [
	"fake.gender()"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": [
	"# add this to the DataFrame\n",
	"gender_list = []\n",
	"\n",
	"for i in range(1000):\n",
	" gender_list.append(fake.gender())\n",
	"\n",
	"fake_df['gender'] = gender_list\n",
	"\n",
	"fake_df['gender'].head()"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": [
	"fake_df.info()"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": [
	"# convert gender column to category\n",
	"fake_df['gender'] = fake_df['gender'].astype('category')\n",
	"\n",
	"fake_df.info()"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": [
	"fake_df['gender'].head()"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": [
	"#export the file\n",
	"fake_df.to_csv('~/Downloads/FATA.csv')"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"<h2>Conclusion</h2>\n",
	"\n",
	"This tutorial showed a few ways in which we can generate fake data - FATA - to allow us to continue to explore and analyze HR data. You could combine this to anonymize your real HR data, to be able to include names, ssn's, etc. all without compromising one of the most fundamental parts of working with HR data - privacy and respect of people's information."
	]
	}
	],
	"metadata": {
	"kernelspec": {
	"display_name": "Python 3",
	"language": "python",
	"name": "python3"
	},
	"language_info": {
	"codemirror_mode": {
	"name": "ipython",
	"version": 3
	},
	"file_extension": ".py",
	"mimetype": "text/x-python",
	"name": "python",
	"nbconvert_exporter": "python",
	"pygments_lexer": "ipython3",
	"version": "3.7.3"
	}
	},
	"nbformat": 4,
	"nbformat_minor": 2
	}