Skip to content

Instantly share code, notes, and snippets.

@tkhan0
Last active November 8, 2019 00:00
Show Gist options
  • Save tkhan0/601237430879471c3ad89a1bc1711001 to your computer and use it in GitHub Desktop.
Save tkhan0/601237430879471c3ad89a1bc1711001 to your computer and use it in GitHub Desktop.
Data Cleaning.ipynb
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"metadata": {
"trusted": true
},
"cell_type": "code",
"source": "import pandas as pd",
"execution_count": 1,
"outputs": []
},
{
"metadata": {
"trusted": true
},
"cell_type": "code",
"source": "survey = \"/developer_survey_2019/survey_results_public.csv\"",
"execution_count": 2,
"outputs": []
},
{
"metadata": {
"trusted": true
},
"cell_type": "code",
"source": "df = pd.read_csv(survey,header= 0)",
"execution_count": 3,
"outputs": []
},
{
"metadata": {},
"cell_type": "markdown",
"source": "#### 1. Checking the shape of the dataset"
},
{
"metadata": {
"trusted": true
},
"cell_type": "code",
"source": "df.shape",
"execution_count": 4,
"outputs": [
{
"output_type": "execute_result",
"execution_count": 4,
"data": {
"text/plain": "(88883, 85)"
},
"metadata": {}
}
]
},
{
"metadata": {
"trusted": true
},
"cell_type": "code",
"source": "df.head()",
"execution_count": 5,
"outputs": [
{
"output_type": "execute_result",
"execution_count": 5,
"data": {
"text/plain": " Respondent MainBranch Hobbyist \\\n0 1 I am a student who is learning to code Yes \n1 2 I am a student who is learning to code No \n2 3 I am not primarily a developer, but I write co... Yes \n3 4 I am a developer by profession No \n4 5 I am a developer by profession Yes \n\n OpenSourcer \\\n0 Never \n1 Less than once per year \n2 Never \n3 Never \n4 Once a month or more often \n\n OpenSource \\\n0 The quality of OSS and closed source software ... \n1 The quality of OSS and closed source software ... \n2 The quality of OSS and closed source software ... \n3 The quality of OSS and closed source software ... \n4 OSS is, on average, of HIGHER quality than pro... \n\n Employment Country \\\n0 Not employed, and not looking for work United Kingdom \n1 Not employed, but looking for work Bosnia and Herzegovina \n2 Employed full-time Thailand \n3 Employed full-time United States \n4 Employed full-time Ukraine \n\n Student EdLevel \\\n0 No Primary/elementary school \n1 Yes, full-time Secondary school (e.g. American high school, G... \n2 No Bachelor’s degree (BA, BS, B.Eng., etc.) \n3 No Bachelor’s degree (BA, BS, B.Eng., etc.) \n4 No Bachelor’s degree (BA, BS, B.Eng., etc.) \n\n UndergradMajor \\\n0 NaN \n1 NaN \n2 Web development or web design \n3 Computer science, computer engineering, or sof... \n4 Computer science, computer engineering, or sof... \n\n ... WelcomeChange \\\n0 ... Just as welcome now as I felt last year \n1 ... Just as welcome now as I felt last year \n2 ... Just as welcome now as I felt last year \n3 ... Just as welcome now as I felt last year \n4 ... Just as welcome now as I felt last year \n\n SONewContent Age Gender Trans \\\n0 Tech articles written by other developers;Indu... 14.0 Man No \n1 Tech articles written by other developers;Indu... 19.0 Man No \n2 Tech meetups or events in your area;Courses on... 28.0 Man No \n3 Tech articles written by other developers;Indu... 22.0 Man No \n4 Tech meetups or events in your area;Courses on... 30.0 Man No \n\n Sexuality Ethnicity \\\n0 Straight / Heterosexual NaN \n1 Straight / Heterosexual NaN \n2 Straight / Heterosexual NaN \n3 Straight / Heterosexual White or of European descent \n4 Straight / Heterosexual White or of European descent;Multiracial \n\n Dependents SurveyLength SurveyEase \n0 No Appropriate in length Neither easy nor difficult \n1 No Appropriate in length Neither easy nor difficult \n2 Yes Appropriate in length Neither easy nor difficult \n3 No Appropriate in length Easy \n4 No Appropriate in length Easy \n\n[5 rows x 85 columns]",
"text/html": "<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead th {\n text-align: right;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>Respondent</th>\n <th>MainBranch</th>\n <th>Hobbyist</th>\n <th>OpenSourcer</th>\n <th>OpenSource</th>\n <th>Employment</th>\n <th>Country</th>\n <th>Student</th>\n <th>EdLevel</th>\n <th>UndergradMajor</th>\n <th>...</th>\n <th>WelcomeChange</th>\n <th>SONewContent</th>\n <th>Age</th>\n <th>Gender</th>\n <th>Trans</th>\n <th>Sexuality</th>\n <th>Ethnicity</th>\n <th>Dependents</th>\n <th>SurveyLength</th>\n <th>SurveyEase</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>0</th>\n <td>1</td>\n <td>I am a student who is learning to code</td>\n <td>Yes</td>\n <td>Never</td>\n <td>The quality of OSS and closed source software ...</td>\n <td>Not employed, and not looking for work</td>\n <td>United Kingdom</td>\n <td>No</td>\n <td>Primary/elementary school</td>\n <td>NaN</td>\n <td>...</td>\n <td>Just as welcome now as I felt last year</td>\n <td>Tech articles written by other developers;Indu...</td>\n <td>14.0</td>\n <td>Man</td>\n <td>No</td>\n <td>Straight / Heterosexual</td>\n <td>NaN</td>\n <td>No</td>\n <td>Appropriate in length</td>\n <td>Neither easy nor difficult</td>\n </tr>\n <tr>\n <th>1</th>\n <td>2</td>\n <td>I am a student who is learning to code</td>\n <td>No</td>\n <td>Less than once per year</td>\n <td>The quality of OSS and closed source software ...</td>\n <td>Not employed, but looking for work</td>\n <td>Bosnia and Herzegovina</td>\n <td>Yes, full-time</td>\n <td>Secondary school (e.g. American high school, G...</td>\n <td>NaN</td>\n <td>...</td>\n <td>Just as welcome now as I felt last year</td>\n <td>Tech articles written by other developers;Indu...</td>\n <td>19.0</td>\n <td>Man</td>\n <td>No</td>\n <td>Straight / Heterosexual</td>\n <td>NaN</td>\n <td>No</td>\n <td>Appropriate in length</td>\n <td>Neither easy nor difficult</td>\n </tr>\n <tr>\n <th>2</th>\n <td>3</td>\n <td>I am not primarily a developer, but I write co...</td>\n <td>Yes</td>\n <td>Never</td>\n <td>The quality of OSS and closed source software ...</td>\n <td>Employed full-time</td>\n <td>Thailand</td>\n <td>No</td>\n <td>Bachelor’s degree (BA, BS, B.Eng., etc.)</td>\n <td>Web development or web design</td>\n <td>...</td>\n <td>Just as welcome now as I felt last year</td>\n <td>Tech meetups or events in your area;Courses on...</td>\n <td>28.0</td>\n <td>Man</td>\n <td>No</td>\n <td>Straight / Heterosexual</td>\n <td>NaN</td>\n <td>Yes</td>\n <td>Appropriate in length</td>\n <td>Neither easy nor difficult</td>\n </tr>\n <tr>\n <th>3</th>\n <td>4</td>\n <td>I am a developer by profession</td>\n <td>No</td>\n <td>Never</td>\n <td>The quality of OSS and closed source software ...</td>\n <td>Employed full-time</td>\n <td>United States</td>\n <td>No</td>\n <td>Bachelor’s degree (BA, BS, B.Eng., etc.)</td>\n <td>Computer science, computer engineering, or sof...</td>\n <td>...</td>\n <td>Just as welcome now as I felt last year</td>\n <td>Tech articles written by other developers;Indu...</td>\n <td>22.0</td>\n <td>Man</td>\n <td>No</td>\n <td>Straight / Heterosexual</td>\n <td>White or of European descent</td>\n <td>No</td>\n <td>Appropriate in length</td>\n <td>Easy</td>\n </tr>\n <tr>\n <th>4</th>\n <td>5</td>\n <td>I am a developer by profession</td>\n <td>Yes</td>\n <td>Once a month or more often</td>\n <td>OSS is, on average, of HIGHER quality than pro...</td>\n <td>Employed full-time</td>\n <td>Ukraine</td>\n <td>No</td>\n <td>Bachelor’s degree (BA, BS, B.Eng., etc.)</td>\n <td>Computer science, computer engineering, or sof...</td>\n <td>...</td>\n <td>Just as welcome now as I felt last year</td>\n <td>Tech meetups or events in your area;Courses on...</td>\n <td>30.0</td>\n <td>Man</td>\n <td>No</td>\n <td>Straight / Heterosexual</td>\n <td>White or of European descent;Multiracial</td>\n <td>No</td>\n <td>Appropriate in length</td>\n <td>Easy</td>\n </tr>\n </tbody>\n</table>\n<p>5 rows × 85 columns</p>\n</div>"
},
"metadata": {}
}
]
},
{
"metadata": {},
"cell_type": "markdown",
"source": "#### 2. To print all the column names of the DataFrame, we'll use the df.columns command"
},
{
"metadata": {
"trusted": true
},
"cell_type": "code",
"source": "df.columns",
"execution_count": 6,
"outputs": [
{
"output_type": "execute_result",
"execution_count": 6,
"data": {
"text/plain": "Index(['Respondent', 'MainBranch', 'Hobbyist', 'OpenSourcer', 'OpenSource',\n 'Employment', 'Country', 'Student', 'EdLevel', 'UndergradMajor',\n 'EduOther', 'OrgSize', 'DevType', 'YearsCode', 'Age1stCode',\n 'YearsCodePro', 'CareerSat', 'JobSat', 'MgrIdiot', 'MgrMoney',\n 'MgrWant', 'JobSeek', 'LastHireDate', 'LastInt', 'FizzBuzz',\n 'JobFactors', 'ResumeUpdate', 'CurrencySymbol', 'CurrencyDesc',\n 'CompTotal', 'CompFreq', 'ConvertedComp', 'WorkWeekHrs', 'WorkPlan',\n 'WorkChallenge', 'WorkRemote', 'WorkLoc', 'ImpSyn', 'CodeRev',\n 'CodeRevHrs', 'UnitTests', 'PurchaseHow', 'PurchaseWhat',\n 'LanguageWorkedWith', 'LanguageDesireNextYear', 'DatabaseWorkedWith',\n 'DatabaseDesireNextYear', 'PlatformWorkedWith',\n 'PlatformDesireNextYear', 'WebFrameWorkedWith',\n 'WebFrameDesireNextYear', 'MiscTechWorkedWith',\n 'MiscTechDesireNextYear', 'DevEnviron', 'OpSys', 'Containers',\n 'BlockchainOrg', 'BlockchainIs', 'BetterLife', 'ITperson', 'OffOn',\n 'SocialMedia', 'Extraversion', 'ScreenName', 'SOVisit1st',\n 'SOVisitFreq', 'SOVisitTo', 'SOFindAnswer', 'SOTimeSaved',\n 'SOHowMuchTime', 'SOAccount', 'SOPartFreq', 'SOJobs', 'EntTeams',\n 'SOComm', 'WelcomeChange', 'SONewContent', 'Age', 'Gender', 'Trans',\n 'Sexuality', 'Ethnicity', 'Dependents', 'SurveyLength', 'SurveyEase'],\n dtype='object')"
},
"metadata": {}
}
]
},
{
"metadata": {},
"cell_type": "markdown",
"source": "#### 3.We can find the total number of rows using the following:"
},
{
"metadata": {
"trusted": true
},
"cell_type": "code",
"source": "df.index",
"execution_count": 7,
"outputs": [
{
"output_type": "execute_result",
"execution_count": 7,
"data": {
"text/plain": "RangeIndex(start=0, stop=88883, step=1)"
},
"metadata": {}
}
]
},
{
"metadata": {},
"cell_type": "markdown",
"source": "## Data Cleaning : Approach - I\n\n### Removing missing data\n\nThe most important step for data pre-processing is checking if the dataset has any missing values. \n\nIf we are creating any kind of machine learning model then our model wouldn't perform well with missing values/data. \nOne of the approaches to mitigate this approach is to remove missing data from the dataset.\n\n\nThe way we do it is delete the row if the missing value corresponds to the places in the row or delete the column if it is having 70-75% of missing data. \n\nThis is not really the threshold value and it mostly depends on how much we wish to fix it. The main disadvantage of this appproach is that we end up losing losing important information, because we are deleting a whole feature based on a few missing values."
},
{
"metadata": {},
"cell_type": "markdown",
"source": "#### 4. Let's see what are the datatypes of all the columns here"
},
{
"metadata": {
"trusted": true
},
"cell_type": "code",
"source": "df.dtypes",
"execution_count": 8,
"outputs": [
{
"output_type": "execute_result",
"execution_count": 8,
"data": {
"text/plain": "Respondent int64\nMainBranch object\nHobbyist object\nOpenSourcer object\nOpenSource object\nEmployment object\nCountry object\nStudent object\nEdLevel object\nUndergradMajor object\nEduOther object\nOrgSize object\nDevType object\nYearsCode object\nAge1stCode object\nYearsCodePro object\nCareerSat object\nJobSat object\nMgrIdiot object\nMgrMoney object\nMgrWant object\nJobSeek object\nLastHireDate object\nLastInt object\nFizzBuzz object\nJobFactors object\nResumeUpdate object\nCurrencySymbol object\nCurrencyDesc object\nCompTotal float64\n ... \nContainers object\nBlockchainOrg object\nBlockchainIs object\nBetterLife object\nITperson object\nOffOn object\nSocialMedia object\nExtraversion object\nScreenName object\nSOVisit1st object\nSOVisitFreq object\nSOVisitTo object\nSOFindAnswer object\nSOTimeSaved object\nSOHowMuchTime object\nSOAccount object\nSOPartFreq object\nSOJobs object\nEntTeams object\nSOComm object\nWelcomeChange object\nSONewContent object\nAge float64\nGender object\nTrans object\nSexuality object\nEthnicity object\nDependents object\nSurveyLength object\nSurveyEase object\nLength: 85, dtype: object"
},
"metadata": {}
}
]
},
{
"metadata": {},
"cell_type": "markdown",
"source": "#### 5. Checking the total \"NaN\" values across all the columns "
},
{
"metadata": {
"trusted": true
},
"cell_type": "code",
"source": "df.isna().sum()",
"execution_count": 9,
"outputs": [
{
"output_type": "execute_result",
"execution_count": 9,
"data": {
"text/plain": "Respondent 0\nMainBranch 552\nHobbyist 0\nOpenSourcer 0\nOpenSource 2041\nEmployment 1702\nCountry 132\nStudent 1869\nEdLevel 2493\nUndergradMajor 13269\nEduOther 4623\nOrgSize 17092\nDevType 7548\nYearsCode 945\nAge1stCode 1249\nYearsCodePro 14552\nCareerSat 16036\nJobSat 17895\nMgrIdiot 27724\nMgrMoney 27726\nMgrWant 27651\nJobSeek 8328\nLastHireDate 9029\nLastInt 21728\nFizzBuzz 17539\nJobFactors 9512\nResumeUpdate 11006\nCurrencySymbol 17491\nCurrencyDesc 17491\nCompTotal 32938\n ... \nContainers 3517\nBlockchainOrg 40708\nBlockchainIs 28718\nBetterLife 2614\nITperson 1742\nOffOn 2220\nSocialMedia 4446\nExtraversion 1578\nScreenName 8397\nSOVisit1st 5006\nSOVisitFreq 620\nSOVisitTo 797\nSOFindAnswer 1067\nSOTimeSaved 2539\nSOHowMuchTime 20505\nSOAccount 1055\nSOPartFreq 14191\nSOJobs 817\nEntTeams 1042\nSOComm 752\nWelcomeChange 3028\nSONewContent 19323\nAge 9673\nGender 3477\nTrans 5276\nSexuality 12736\nEthnicity 12215\nDependents 5824\nSurveyLength 1899\nSurveyEase 1802\nLength: 85, dtype: int64"
},
"metadata": {}
}
]
},
{
"metadata": {},
"cell_type": "markdown",
"source": "#### 6. Now that we have seen the columns that has missing values, we can remove them using the dropna() function"
},
{
"metadata": {
"trusted": true
},
"cell_type": "code",
"source": "df = df.dropna()",
"execution_count": 10,
"outputs": []
},
{
"metadata": {},
"cell_type": "markdown",
"source": "#### 7. Check if there are any null values present now"
},
{
"metadata": {
"trusted": true
},
"cell_type": "code",
"source": "df.isna().sum()",
"execution_count": 11,
"outputs": [
{
"output_type": "execute_result",
"execution_count": 11,
"data": {
"text/plain": "Respondent 0\nMainBranch 0\nHobbyist 0\nOpenSourcer 0\nOpenSource 0\nEmployment 0\nCountry 0\nStudent 0\nEdLevel 0\nUndergradMajor 0\nEduOther 0\nOrgSize 0\nDevType 0\nYearsCode 0\nAge1stCode 0\nYearsCodePro 0\nCareerSat 0\nJobSat 0\nMgrIdiot 0\nMgrMoney 0\nMgrWant 0\nJobSeek 0\nLastHireDate 0\nLastInt 0\nFizzBuzz 0\nJobFactors 0\nResumeUpdate 0\nCurrencySymbol 0\nCurrencyDesc 0\nCompTotal 0\n ..\nContainers 0\nBlockchainOrg 0\nBlockchainIs 0\nBetterLife 0\nITperson 0\nOffOn 0\nSocialMedia 0\nExtraversion 0\nScreenName 0\nSOVisit1st 0\nSOVisitFreq 0\nSOVisitTo 0\nSOFindAnswer 0\nSOTimeSaved 0\nSOHowMuchTime 0\nSOAccount 0\nSOPartFreq 0\nSOJobs 0\nEntTeams 0\nSOComm 0\nWelcomeChange 0\nSONewContent 0\nAge 0\nGender 0\nTrans 0\nSexuality 0\nEthnicity 0\nDependents 0\nSurveyLength 0\nSurveyEase 0\nLength: 85, dtype: int64"
},
"metadata": {}
}
]
},
{
"metadata": {},
"cell_type": "markdown",
"source": "## Data Cleaning : Approach - II\n\n### Mean/Median/Mode Imputation for handling missing data\n\nIn this approach we will calculate the mean/median for numerical data and use the result to replace the missing values. For missing values in case of categorical data we compute the mode and replace the missing data with the mode. The benefit of this approach is it prevents data loss, however the disadvantage of this approach is you are not sure how accurate the mean, median or mode is going to be in a given use case."
},
{
"metadata": {},
"cell_type": "markdown",
"source": "#### 1. Load the file into dataFrame df"
},
{
"metadata": {
"trusted": true
},
"cell_type": "code",
"source": "survey = \"/developer_survey_2019/survey_results_public.csv\"",
"execution_count": 12,
"outputs": []
},
{
"metadata": {
"trusted": true
},
"cell_type": "code",
"source": "df = pd.read_csv(survey)",
"execution_count": 13,
"outputs": []
},
{
"metadata": {
"trusted": true
},
"cell_type": "code",
"source": "df.head()",
"execution_count": 14,
"outputs": [
{
"output_type": "execute_result",
"execution_count": 14,
"data": {
"text/plain": " Respondent MainBranch Hobbyist \\\n0 1 I am a student who is learning to code Yes \n1 2 I am a student who is learning to code No \n2 3 I am not primarily a developer, but I write co... Yes \n3 4 I am a developer by profession No \n4 5 I am a developer by profession Yes \n\n OpenSourcer \\\n0 Never \n1 Less than once per year \n2 Never \n3 Never \n4 Once a month or more often \n\n OpenSource \\\n0 The quality of OSS and closed source software ... \n1 The quality of OSS and closed source software ... \n2 The quality of OSS and closed source software ... \n3 The quality of OSS and closed source software ... \n4 OSS is, on average, of HIGHER quality than pro... \n\n Employment Country \\\n0 Not employed, and not looking for work United Kingdom \n1 Not employed, but looking for work Bosnia and Herzegovina \n2 Employed full-time Thailand \n3 Employed full-time United States \n4 Employed full-time Ukraine \n\n Student EdLevel \\\n0 No Primary/elementary school \n1 Yes, full-time Secondary school (e.g. American high school, G... \n2 No Bachelor’s degree (BA, BS, B.Eng., etc.) \n3 No Bachelor’s degree (BA, BS, B.Eng., etc.) \n4 No Bachelor’s degree (BA, BS, B.Eng., etc.) \n\n UndergradMajor \\\n0 NaN \n1 NaN \n2 Web development or web design \n3 Computer science, computer engineering, or sof... \n4 Computer science, computer engineering, or sof... \n\n ... WelcomeChange \\\n0 ... Just as welcome now as I felt last year \n1 ... Just as welcome now as I felt last year \n2 ... Just as welcome now as I felt last year \n3 ... Just as welcome now as I felt last year \n4 ... Just as welcome now as I felt last year \n\n SONewContent Age Gender Trans \\\n0 Tech articles written by other developers;Indu... 14.0 Man No \n1 Tech articles written by other developers;Indu... 19.0 Man No \n2 Tech meetups or events in your area;Courses on... 28.0 Man No \n3 Tech articles written by other developers;Indu... 22.0 Man No \n4 Tech meetups or events in your area;Courses on... 30.0 Man No \n\n Sexuality Ethnicity \\\n0 Straight / Heterosexual NaN \n1 Straight / Heterosexual NaN \n2 Straight / Heterosexual NaN \n3 Straight / Heterosexual White or of European descent \n4 Straight / Heterosexual White or of European descent;Multiracial \n\n Dependents SurveyLength SurveyEase \n0 No Appropriate in length Neither easy nor difficult \n1 No Appropriate in length Neither easy nor difficult \n2 Yes Appropriate in length Neither easy nor difficult \n3 No Appropriate in length Easy \n4 No Appropriate in length Easy \n\n[5 rows x 85 columns]",
"text/html": "<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead th {\n text-align: right;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>Respondent</th>\n <th>MainBranch</th>\n <th>Hobbyist</th>\n <th>OpenSourcer</th>\n <th>OpenSource</th>\n <th>Employment</th>\n <th>Country</th>\n <th>Student</th>\n <th>EdLevel</th>\n <th>UndergradMajor</th>\n <th>...</th>\n <th>WelcomeChange</th>\n <th>SONewContent</th>\n <th>Age</th>\n <th>Gender</th>\n <th>Trans</th>\n <th>Sexuality</th>\n <th>Ethnicity</th>\n <th>Dependents</th>\n <th>SurveyLength</th>\n <th>SurveyEase</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>0</th>\n <td>1</td>\n <td>I am a student who is learning to code</td>\n <td>Yes</td>\n <td>Never</td>\n <td>The quality of OSS and closed source software ...</td>\n <td>Not employed, and not looking for work</td>\n <td>United Kingdom</td>\n <td>No</td>\n <td>Primary/elementary school</td>\n <td>NaN</td>\n <td>...</td>\n <td>Just as welcome now as I felt last year</td>\n <td>Tech articles written by other developers;Indu...</td>\n <td>14.0</td>\n <td>Man</td>\n <td>No</td>\n <td>Straight / Heterosexual</td>\n <td>NaN</td>\n <td>No</td>\n <td>Appropriate in length</td>\n <td>Neither easy nor difficult</td>\n </tr>\n <tr>\n <th>1</th>\n <td>2</td>\n <td>I am a student who is learning to code</td>\n <td>No</td>\n <td>Less than once per year</td>\n <td>The quality of OSS and closed source software ...</td>\n <td>Not employed, but looking for work</td>\n <td>Bosnia and Herzegovina</td>\n <td>Yes, full-time</td>\n <td>Secondary school (e.g. American high school, G...</td>\n <td>NaN</td>\n <td>...</td>\n <td>Just as welcome now as I felt last year</td>\n <td>Tech articles written by other developers;Indu...</td>\n <td>19.0</td>\n <td>Man</td>\n <td>No</td>\n <td>Straight / Heterosexual</td>\n <td>NaN</td>\n <td>No</td>\n <td>Appropriate in length</td>\n <td>Neither easy nor difficult</td>\n </tr>\n <tr>\n <th>2</th>\n <td>3</td>\n <td>I am not primarily a developer, but I write co...</td>\n <td>Yes</td>\n <td>Never</td>\n <td>The quality of OSS and closed source software ...</td>\n <td>Employed full-time</td>\n <td>Thailand</td>\n <td>No</td>\n <td>Bachelor’s degree (BA, BS, B.Eng., etc.)</td>\n <td>Web development or web design</td>\n <td>...</td>\n <td>Just as welcome now as I felt last year</td>\n <td>Tech meetups or events in your area;Courses on...</td>\n <td>28.0</td>\n <td>Man</td>\n <td>No</td>\n <td>Straight / Heterosexual</td>\n <td>NaN</td>\n <td>Yes</td>\n <td>Appropriate in length</td>\n <td>Neither easy nor difficult</td>\n </tr>\n <tr>\n <th>3</th>\n <td>4</td>\n <td>I am a developer by profession</td>\n <td>No</td>\n <td>Never</td>\n <td>The quality of OSS and closed source software ...</td>\n <td>Employed full-time</td>\n <td>United States</td>\n <td>No</td>\n <td>Bachelor’s degree (BA, BS, B.Eng., etc.)</td>\n <td>Computer science, computer engineering, or sof...</td>\n <td>...</td>\n <td>Just as welcome now as I felt last year</td>\n <td>Tech articles written by other developers;Indu...</td>\n <td>22.0</td>\n <td>Man</td>\n <td>No</td>\n <td>Straight / Heterosexual</td>\n <td>White or of European descent</td>\n <td>No</td>\n <td>Appropriate in length</td>\n <td>Easy</td>\n </tr>\n <tr>\n <th>4</th>\n <td>5</td>\n <td>I am a developer by profession</td>\n <td>Yes</td>\n <td>Once a month or more often</td>\n <td>OSS is, on average, of HIGHER quality than pro...</td>\n <td>Employed full-time</td>\n <td>Ukraine</td>\n <td>No</td>\n <td>Bachelor’s degree (BA, BS, B.Eng., etc.)</td>\n <td>Computer science, computer engineering, or sof...</td>\n <td>...</td>\n <td>Just as welcome now as I felt last year</td>\n <td>Tech meetups or events in your area;Courses on...</td>\n <td>30.0</td>\n <td>Man</td>\n <td>No</td>\n <td>Straight / Heterosexual</td>\n <td>White or of European descent;Multiracial</td>\n <td>No</td>\n <td>Appropriate in length</td>\n <td>Easy</td>\n </tr>\n </tbody>\n</table>\n<p>5 rows × 85 columns</p>\n</div>"
},
"metadata": {}
}
]
},
{
"metadata": {},
"cell_type": "markdown",
"source": "#### 2. check the data types for individual columns to understand which is a numerical and which one is a categorical column"
},
{
"metadata": {
"trusted": true
},
"cell_type": "code",
"source": "df.dtypes",
"execution_count": 15,
"outputs": [
{
"output_type": "execute_result",
"execution_count": 15,
"data": {
"text/plain": "Respondent int64\nMainBranch object\nHobbyist object\nOpenSourcer object\nOpenSource object\nEmployment object\nCountry object\nStudent object\nEdLevel object\nUndergradMajor object\nEduOther object\nOrgSize object\nDevType object\nYearsCode object\nAge1stCode object\nYearsCodePro object\nCareerSat object\nJobSat object\nMgrIdiot object\nMgrMoney object\nMgrWant object\nJobSeek object\nLastHireDate object\nLastInt object\nFizzBuzz object\nJobFactors object\nResumeUpdate object\nCurrencySymbol object\nCurrencyDesc object\nCompTotal float64\n ... \nContainers object\nBlockchainOrg object\nBlockchainIs object\nBetterLife object\nITperson object\nOffOn object\nSocialMedia object\nExtraversion object\nScreenName object\nSOVisit1st object\nSOVisitFreq object\nSOVisitTo object\nSOFindAnswer object\nSOTimeSaved object\nSOHowMuchTime object\nSOAccount object\nSOPartFreq object\nSOJobs object\nEntTeams object\nSOComm object\nWelcomeChange object\nSONewContent object\nAge float64\nGender object\nTrans object\nSexuality object\nEthnicity object\nDependents object\nSurveyLength object\nSurveyEase object\nLength: 85, dtype: object"
},
"metadata": {}
}
]
},
{
"metadata": {},
"cell_type": "markdown",
"source": "#### 3. Impute the numerical data of CompTotal column with its median. To do so, first find the median of the CompTotal column using the median() function of pandas, and then print it:"
},
{
"metadata": {
"trusted": true
},
"cell_type": "code",
"source": "median_CompTotal = df.CompTotal.median()\nprint(median_CompTotal)",
"execution_count": 16,
"outputs": [
{
"output_type": "stream",
"text": "62000.0\n",
"name": "stdout"
}
]
},
{
"metadata": {},
"cell_type": "markdown",
"source": "#### 4.Impute the numerical data of Age column with its mean. To do so, first find the mean of the Age column using the mean() function of pandas, and then print it:"
},
{
"metadata": {
"trusted": true
},
"cell_type": "code",
"source": "mean_Age = df.Age.mean()\nprint(mean_Age)",
"execution_count": 17,
"outputs": [
{
"output_type": "stream",
"text": "30.336698649160446\n",
"name": "stdout"
}
]
},
{
"metadata": {},
"cell_type": "markdown",
"source": "#### 5. Check to see the \"NaN\" values for different columns"
},
{
"metadata": {
"trusted": true
},
"cell_type": "code",
"source": "df.isna().sum()",
"execution_count": 18,
"outputs": [
{
"output_type": "execute_result",
"execution_count": 18,
"data": {
"text/plain": "Respondent 0\nMainBranch 552\nHobbyist 0\nOpenSourcer 0\nOpenSource 2041\nEmployment 1702\nCountry 132\nStudent 1869\nEdLevel 2493\nUndergradMajor 13269\nEduOther 4623\nOrgSize 17092\nDevType 7548\nYearsCode 945\nAge1stCode 1249\nYearsCodePro 14552\nCareerSat 16036\nJobSat 17895\nMgrIdiot 27724\nMgrMoney 27726\nMgrWant 27651\nJobSeek 8328\nLastHireDate 9029\nLastInt 21728\nFizzBuzz 17539\nJobFactors 9512\nResumeUpdate 11006\nCurrencySymbol 17491\nCurrencyDesc 17491\nCompTotal 32938\n ... \nContainers 3517\nBlockchainOrg 40708\nBlockchainIs 28718\nBetterLife 2614\nITperson 1742\nOffOn 2220\nSocialMedia 4446\nExtraversion 1578\nScreenName 8397\nSOVisit1st 5006\nSOVisitFreq 620\nSOVisitTo 797\nSOFindAnswer 1067\nSOTimeSaved 2539\nSOHowMuchTime 20505\nSOAccount 1055\nSOPartFreq 14191\nSOJobs 817\nEntTeams 1042\nSOComm 752\nWelcomeChange 3028\nSONewContent 19323\nAge 9673\nGender 3477\nTrans 5276\nSexuality 12736\nEthnicity 12215\nDependents 5824\nSurveyLength 1899\nSurveyEase 1802\nLength: 85, dtype: int64"
},
"metadata": {}
}
]
},
{
"metadata": {},
"cell_type": "markdown",
"source": "#### 6. Now that we have computed the median for the numerical columns containing missing values, we can fill those missing values of columns with the computed numerical values"
},
{
"metadata": {
"trusted": true
},
"cell_type": "code",
"source": "df.Age.fillna(mean_Age, inplace = True)",
"execution_count": 19,
"outputs": []
},
{
"metadata": {
"trusted": true
},
"cell_type": "code",
"source": "df.CompTotal.fillna(median_CompTotal,inplace = True)",
"execution_count": 20,
"outputs": []
},
{
"metadata": {},
"cell_type": "markdown",
"source": "#### 7. check if Age and CompTotal columns contain any \"NaN\" values"
},
{
"metadata": {
"trusted": true
},
"cell_type": "code",
"source": "df.isna().sum().Age",
"execution_count": 21,
"outputs": [
{
"output_type": "execute_result",
"execution_count": 21,
"data": {
"text/plain": "0"
},
"metadata": {}
}
]
},
{
"metadata": {
"trusted": true
},
"cell_type": "code",
"source": "df.isna().sum().CompTotal",
"execution_count": 22,
"outputs": [
{
"output_type": "execute_result",
"execution_count": 22,
"data": {
"text/plain": "0"
},
"metadata": {}
}
]
},
{
"metadata": {},
"cell_type": "markdown",
"source": "#### 8. Compute the mode for the categorical columns containing missing values"
},
{
"metadata": {
"trusted": true
},
"cell_type": "code",
"source": "mode_MainBranch = df.MainBranch.mode()[0]",
"execution_count": 23,
"outputs": []
},
{
"metadata": {
"trusted": true
},
"cell_type": "code",
"source": "mode_MainBranch",
"execution_count": 24,
"outputs": [
{
"output_type": "execute_result",
"execution_count": 24,
"data": {
"text/plain": "'I am a developer by profession'"
},
"metadata": {}
}
]
},
{
"metadata": {
"trusted": true
},
"cell_type": "code",
"source": "mode_OpenSource = df.OpenSource.mode()[0]",
"execution_count": 25,
"outputs": []
},
{
"metadata": {
"trusted": true
},
"cell_type": "code",
"source": "mode_OpenSource",
"execution_count": 26,
"outputs": [
{
"output_type": "execute_result",
"execution_count": 26,
"data": {
"text/plain": "'The quality of OSS and closed source software is about the same'"
},
"metadata": {}
}
]
},
{
"metadata": {},
"cell_type": "markdown",
"source": "#### 9. Now that we have computed the mode for the individual columns containing missing values, we can replace those missing data with the mode values for respective columns"
},
{
"metadata": {
"trusted": true
},
"cell_type": "code",
"source": "df.MainBranch.fillna(mode_MainBranch, inplace = True)",
"execution_count": 27,
"outputs": []
},
{
"metadata": {
"trusted": true
},
"cell_type": "code",
"source": "df.OpenSource.fillna(mode_OpenSource,inplace=True)",
"execution_count": 28,
"outputs": []
},
{
"metadata": {},
"cell_type": "markdown",
"source": "#### 10. Now if you check the \"MainBranch\" and \"OpenSource\" columns it has no missing values anymore"
},
{
"metadata": {
"trusted": true
},
"cell_type": "code",
"source": "df.isna().sum().MainBranch",
"execution_count": 29,
"outputs": [
{
"output_type": "execute_result",
"execution_count": 29,
"data": {
"text/plain": "0"
},
"metadata": {}
}
]
},
{
"metadata": {
"trusted": true
},
"cell_type": "code",
"source": "df.isna().sum().OpenSource",
"execution_count": 30,
"outputs": [
{
"output_type": "execute_result",
"execution_count": 30,
"data": {
"text/plain": "0"
},
"metadata": {}
}
]
},
{
"metadata": {},
"cell_type": "markdown",
"source": "## Data Transformation in case of Numerical/Categorical data\n\nIn the above section we did some adjustments to the data by either removing the missing values or by replacing the missing values with mean/median/mode of that particular column. The main goal here is to transform our data into a machine-learning digestable format. As all the machine learning algorithms are mathematical compuation, there is need to transform all the columns into numerical format.\n\nLet's first understand the broader classification of data and try breaking down the broader categories into sub categories.\n\n**1. Numerical** : Numerical data is one that is quantifiable\n**2. Categorical** : These data are non-numeric, generally string which are qualitative.\n\nWe can further breakdown numerical data into following sub-categories:\n\n**1. Discrete**: If you are able to count something then it's discrete. For example, the number of passengers in a bus, the number of people attending a particular meeting, ex., 1,2,3,4.\n\n**2. Continuous**: The numerical form of data which could be measured is continuous. Example, the weight of a person, time taken to travel from one location to another.\n\nSimilarly categorical data are also broken down into below sub categories:\n\n**1. Ordered** : Ordered data are those in which the data is bucketed into certain categories. Example, the Survey for a particular show has can be-- excellent, good, bad, worst.\n\n**2. Nominal** : These are categorical data which doesn't have any order. For example, country.\n\n## Challenges with Categorical Data\n\nThere are challenges while dealing with categorical data and most of the machine learning algorithm dont work well with categorical data. Decision trees will work well with categorical data but of we are dealing with some other machine learning algorithms then we need to convert these categorical data to numerical form. If the desired output needs to be categorical then we can convert the numerical data back to categorical format.\n\nLet's see what are the challenges that we might face while dealing with categorical data:\n\n1. **Data with high cardinality**: We might have few columns in out data set which will have a very high cardinality which means that they will have a lot of unique values. For example: the ID column in the data set will have all the unique values in it.\n\n\n2. **Variables with rare occurances**: We might have some data columns as well with very rare occuring variables.\n\n\n3. **Frequent occuring variables**: We might also have some data columns as well which occur many times with low variance.\n\n\n4. We might also encounter some data columns which **won't fit** the model at all if we don't process it.\n\nTo overcome all these above mentioned challenges we use the following methods:\n\n1. **Encoding** : In this method we encode the categorical data to a numerical value. There are 3 types of encoding that we basically follow\n \n i. **Label encoding**\n \n ii.**One hot encoding** \n \n iii. **Dummy encoding**\n\n\n2. **Replacing**: In this method we simply replace the categorical data with a number. This doesnot involve any logical processing."
},
{
"metadata": {},
"cell_type": "markdown",
"source": "## Dealing with Categorical data - Approach I\n\n### 1. Replace categorical data with number"
},
{
"metadata": {},
"cell_type": "markdown",
"source": "#### 1. Find the categorical data in the current dataframe and then create a new dataframe having only the categorical data. To do so use the select_dtypes() function from pandas"
},
{
"metadata": {
"trusted": true
},
"cell_type": "code",
"source": "import numpy as np\ndf_category = df.select_dtypes(exclude=[np.number])",
"execution_count": 31,
"outputs": []
},
{
"metadata": {
"trusted": true
},
"cell_type": "code",
"source": "df_category.head()",
"execution_count": 32,
"outputs": [
{
"output_type": "execute_result",
"execution_count": 32,
"data": {
"text/plain": " MainBranch Hobbyist \\\n0 I am a student who is learning to code Yes \n1 I am a student who is learning to code No \n2 I am not primarily a developer, but I write co... Yes \n3 I am a developer by profession No \n4 I am a developer by profession Yes \n\n OpenSourcer \\\n0 Never \n1 Less than once per year \n2 Never \n3 Never \n4 Once a month or more often \n\n OpenSource \\\n0 The quality of OSS and closed source software ... \n1 The quality of OSS and closed source software ... \n2 The quality of OSS and closed source software ... \n3 The quality of OSS and closed source software ... \n4 OSS is, on average, of HIGHER quality than pro... \n\n Employment Country \\\n0 Not employed, and not looking for work United Kingdom \n1 Not employed, but looking for work Bosnia and Herzegovina \n2 Employed full-time Thailand \n3 Employed full-time United States \n4 Employed full-time Ukraine \n\n Student EdLevel \\\n0 No Primary/elementary school \n1 Yes, full-time Secondary school (e.g. American high school, G... \n2 No Bachelor’s degree (BA, BS, B.Eng., etc.) \n3 No Bachelor’s degree (BA, BS, B.Eng., etc.) \n4 No Bachelor’s degree (BA, BS, B.Eng., etc.) \n\n UndergradMajor \\\n0 NaN \n1 NaN \n2 Web development or web design \n3 Computer science, computer engineering, or sof... \n4 Computer science, computer engineering, or sof... \n\n EduOther \\\n0 Taught yourself a new language, framework, or ... \n1 Taken an online course in programming or softw... \n2 Taught yourself a new language, framework, or ... \n3 Taken an online course in programming or softw... \n4 Taken an online course in programming or softw... \n\n ... SOComm \\\n0 ... Neutral \n1 ... Yes, somewhat \n2 ... Neutral \n3 ... No, not really \n4 ... Yes, definitely \n\n WelcomeChange \\\n0 Just as welcome now as I felt last year \n1 Just as welcome now as I felt last year \n2 Just as welcome now as I felt last year \n3 Just as welcome now as I felt last year \n4 Just as welcome now as I felt last year \n\n SONewContent Gender Trans \\\n0 Tech articles written by other developers;Indu... Man No \n1 Tech articles written by other developers;Indu... Man No \n2 Tech meetups or events in your area;Courses on... Man No \n3 Tech articles written by other developers;Indu... Man No \n4 Tech meetups or events in your area;Courses on... Man No \n\n Sexuality Ethnicity \\\n0 Straight / Heterosexual NaN \n1 Straight / Heterosexual NaN \n2 Straight / Heterosexual NaN \n3 Straight / Heterosexual White or of European descent \n4 Straight / Heterosexual White or of European descent;Multiracial \n\n Dependents SurveyLength SurveyEase \n0 No Appropriate in length Neither easy nor difficult \n1 No Appropriate in length Neither easy nor difficult \n2 Yes Appropriate in length Neither easy nor difficult \n3 No Appropriate in length Easy \n4 No Appropriate in length Easy \n\n[5 rows x 79 columns]",
"text/html": "<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead th {\n text-align: right;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>MainBranch</th>\n <th>Hobbyist</th>\n <th>OpenSourcer</th>\n <th>OpenSource</th>\n <th>Employment</th>\n <th>Country</th>\n <th>Student</th>\n <th>EdLevel</th>\n <th>UndergradMajor</th>\n <th>EduOther</th>\n <th>...</th>\n <th>SOComm</th>\n <th>WelcomeChange</th>\n <th>SONewContent</th>\n <th>Gender</th>\n <th>Trans</th>\n <th>Sexuality</th>\n <th>Ethnicity</th>\n <th>Dependents</th>\n <th>SurveyLength</th>\n <th>SurveyEase</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>0</th>\n <td>I am a student who is learning to code</td>\n <td>Yes</td>\n <td>Never</td>\n <td>The quality of OSS and closed source software ...</td>\n <td>Not employed, and not looking for work</td>\n <td>United Kingdom</td>\n <td>No</td>\n <td>Primary/elementary school</td>\n <td>NaN</td>\n <td>Taught yourself a new language, framework, or ...</td>\n <td>...</td>\n <td>Neutral</td>\n <td>Just as welcome now as I felt last year</td>\n <td>Tech articles written by other developers;Indu...</td>\n <td>Man</td>\n <td>No</td>\n <td>Straight / Heterosexual</td>\n <td>NaN</td>\n <td>No</td>\n <td>Appropriate in length</td>\n <td>Neither easy nor difficult</td>\n </tr>\n <tr>\n <th>1</th>\n <td>I am a student who is learning to code</td>\n <td>No</td>\n <td>Less than once per year</td>\n <td>The quality of OSS and closed source software ...</td>\n <td>Not employed, but looking for work</td>\n <td>Bosnia and Herzegovina</td>\n <td>Yes, full-time</td>\n <td>Secondary school (e.g. American high school, G...</td>\n <td>NaN</td>\n <td>Taken an online course in programming or softw...</td>\n <td>...</td>\n <td>Yes, somewhat</td>\n <td>Just as welcome now as I felt last year</td>\n <td>Tech articles written by other developers;Indu...</td>\n <td>Man</td>\n <td>No</td>\n <td>Straight / Heterosexual</td>\n <td>NaN</td>\n <td>No</td>\n <td>Appropriate in length</td>\n <td>Neither easy nor difficult</td>\n </tr>\n <tr>\n <th>2</th>\n <td>I am not primarily a developer, but I write co...</td>\n <td>Yes</td>\n <td>Never</td>\n <td>The quality of OSS and closed source software ...</td>\n <td>Employed full-time</td>\n <td>Thailand</td>\n <td>No</td>\n <td>Bachelor’s degree (BA, BS, B.Eng., etc.)</td>\n <td>Web development or web design</td>\n <td>Taught yourself a new language, framework, or ...</td>\n <td>...</td>\n <td>Neutral</td>\n <td>Just as welcome now as I felt last year</td>\n <td>Tech meetups or events in your area;Courses on...</td>\n <td>Man</td>\n <td>No</td>\n <td>Straight / Heterosexual</td>\n <td>NaN</td>\n <td>Yes</td>\n <td>Appropriate in length</td>\n <td>Neither easy nor difficult</td>\n </tr>\n <tr>\n <th>3</th>\n <td>I am a developer by profession</td>\n <td>No</td>\n <td>Never</td>\n <td>The quality of OSS and closed source software ...</td>\n <td>Employed full-time</td>\n <td>United States</td>\n <td>No</td>\n <td>Bachelor’s degree (BA, BS, B.Eng., etc.)</td>\n <td>Computer science, computer engineering, or sof...</td>\n <td>Taken an online course in programming or softw...</td>\n <td>...</td>\n <td>No, not really</td>\n <td>Just as welcome now as I felt last year</td>\n <td>Tech articles written by other developers;Indu...</td>\n <td>Man</td>\n <td>No</td>\n <td>Straight / Heterosexual</td>\n <td>White or of European descent</td>\n <td>No</td>\n <td>Appropriate in length</td>\n <td>Easy</td>\n </tr>\n <tr>\n <th>4</th>\n <td>I am a developer by profession</td>\n <td>Yes</td>\n <td>Once a month or more often</td>\n <td>OSS is, on average, of HIGHER quality than pro...</td>\n <td>Employed full-time</td>\n <td>Ukraine</td>\n <td>No</td>\n <td>Bachelor’s degree (BA, BS, B.Eng., etc.)</td>\n <td>Computer science, computer engineering, or sof...</td>\n <td>Taken an online course in programming or softw...</td>\n <td>...</td>\n <td>Yes, definitely</td>\n <td>Just as welcome now as I felt last year</td>\n <td>Tech meetups or events in your area;Courses on...</td>\n <td>Man</td>\n <td>No</td>\n <td>Straight / Heterosexual</td>\n <td>White or of European descent;Multiracial</td>\n <td>No</td>\n <td>Appropriate in length</td>\n <td>Easy</td>\n </tr>\n </tbody>\n</table>\n<p>5 rows × 79 columns</p>\n</div>"
},
"metadata": {}
}
]
},
{
"metadata": {},
"cell_type": "markdown",
"source": "#### As you could see out of 85 columns there are 79 columns that are categorical columns"
},
{
"metadata": {},
"cell_type": "markdown",
"source": "#### 2. Lets take one categorical column \"OpenSourcer\" and see the unique categorical values in it, so that we could replace it with numerical value"
},
{
"metadata": {
"trusted": true
},
"cell_type": "code",
"source": "df_category['OpenSourcer'].unique()",
"execution_count": 33,
"outputs": [
{
"output_type": "execute_result",
"execution_count": 33,
"data": {
"text/plain": "array(['Never', 'Less than once per year', 'Once a month or more often',\n 'Less than once a month but more than once per year'], dtype=object)"
},
"metadata": {}
}
]
},
{
"metadata": {},
"cell_type": "markdown",
"source": "#### There are 4 categorical values in the \"OpenSourcer\" column"
},
{
"metadata": {},
"cell_type": "markdown",
"source": "#### 3. Find the frequency distribution of each categorical column. To do so, use the value_counts() function on each column. This function returns the counts of unique values in an object"
},
{
"metadata": {
"trusted": true
},
"cell_type": "code",
"source": "df_category.OpenSourcer.value_counts()",
"execution_count": 34,
"outputs": [
{
"output_type": "execute_result",
"execution_count": 34,
"data": {
"text/plain": "Never 32295\nLess than once per year 24972\nLess than once a month but more than once per year 20561\nOnce a month or more often 11055\nName: OpenSourcer, dtype: int64"
},
"metadata": {}
}
]
},
{
"metadata": {},
"cell_type": "markdown",
"source": "#### 4. replace the entries in the \"OpenSourcer\" column with numerical values as below"
},
{
"metadata": {
"trusted": true
},
"cell_type": "code",
"source": "df_category.OpenSourcer.replace({'Never':1,'Less than once per year':2,'Less than once a month but more than once per year':3,'Once a month or more often':4},inplace = True)",
"execution_count": 35,
"outputs": [
{
"output_type": "stream",
"text": "C:\\Users\\tkhan050\\AppData\\Local\\Continuum\\anaconda3\\lib\\site-packages\\pandas\\core\\generic.py:5890: SettingWithCopyWarning: \nA value is trying to be set on a copy of a slice from a DataFrame\n\nSee the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy\n self._update_inplace(new_data)\n",
"name": "stderr"
}
]
},
{
"metadata": {
"trusted": true
},
"cell_type": "code",
"source": "df_category.head()",
"execution_count": 36,
"outputs": [
{
"output_type": "execute_result",
"execution_count": 36,
"data": {
"text/plain": " MainBranch Hobbyist OpenSourcer \\\n0 I am a student who is learning to code Yes 1 \n1 I am a student who is learning to code No 2 \n2 I am not primarily a developer, but I write co... Yes 1 \n3 I am a developer by profession No 1 \n4 I am a developer by profession Yes 4 \n\n OpenSource \\\n0 The quality of OSS and closed source software ... \n1 The quality of OSS and closed source software ... \n2 The quality of OSS and closed source software ... \n3 The quality of OSS and closed source software ... \n4 OSS is, on average, of HIGHER quality than pro... \n\n Employment Country \\\n0 Not employed, and not looking for work United Kingdom \n1 Not employed, but looking for work Bosnia and Herzegovina \n2 Employed full-time Thailand \n3 Employed full-time United States \n4 Employed full-time Ukraine \n\n Student EdLevel \\\n0 No Primary/elementary school \n1 Yes, full-time Secondary school (e.g. American high school, G... \n2 No Bachelor’s degree (BA, BS, B.Eng., etc.) \n3 No Bachelor’s degree (BA, BS, B.Eng., etc.) \n4 No Bachelor’s degree (BA, BS, B.Eng., etc.) \n\n UndergradMajor \\\n0 NaN \n1 NaN \n2 Web development or web design \n3 Computer science, computer engineering, or sof... \n4 Computer science, computer engineering, or sof... \n\n EduOther \\\n0 Taught yourself a new language, framework, or ... \n1 Taken an online course in programming or softw... \n2 Taught yourself a new language, framework, or ... \n3 Taken an online course in programming or softw... \n4 Taken an online course in programming or softw... \n\n ... SOComm \\\n0 ... Neutral \n1 ... Yes, somewhat \n2 ... Neutral \n3 ... No, not really \n4 ... Yes, definitely \n\n WelcomeChange \\\n0 Just as welcome now as I felt last year \n1 Just as welcome now as I felt last year \n2 Just as welcome now as I felt last year \n3 Just as welcome now as I felt last year \n4 Just as welcome now as I felt last year \n\n SONewContent Gender Trans \\\n0 Tech articles written by other developers;Indu... Man No \n1 Tech articles written by other developers;Indu... Man No \n2 Tech meetups or events in your area;Courses on... Man No \n3 Tech articles written by other developers;Indu... Man No \n4 Tech meetups or events in your area;Courses on... Man No \n\n Sexuality Ethnicity \\\n0 Straight / Heterosexual NaN \n1 Straight / Heterosexual NaN \n2 Straight / Heterosexual NaN \n3 Straight / Heterosexual White or of European descent \n4 Straight / Heterosexual White or of European descent;Multiracial \n\n Dependents SurveyLength SurveyEase \n0 No Appropriate in length Neither easy nor difficult \n1 No Appropriate in length Neither easy nor difficult \n2 Yes Appropriate in length Neither easy nor difficult \n3 No Appropriate in length Easy \n4 No Appropriate in length Easy \n\n[5 rows x 79 columns]",
"text/html": "<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead th {\n text-align: right;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>MainBranch</th>\n <th>Hobbyist</th>\n <th>OpenSourcer</th>\n <th>OpenSource</th>\n <th>Employment</th>\n <th>Country</th>\n <th>Student</th>\n <th>EdLevel</th>\n <th>UndergradMajor</th>\n <th>EduOther</th>\n <th>...</th>\n <th>SOComm</th>\n <th>WelcomeChange</th>\n <th>SONewContent</th>\n <th>Gender</th>\n <th>Trans</th>\n <th>Sexuality</th>\n <th>Ethnicity</th>\n <th>Dependents</th>\n <th>SurveyLength</th>\n <th>SurveyEase</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>0</th>\n <td>I am a student who is learning to code</td>\n <td>Yes</td>\n <td>1</td>\n <td>The quality of OSS and closed source software ...</td>\n <td>Not employed, and not looking for work</td>\n <td>United Kingdom</td>\n <td>No</td>\n <td>Primary/elementary school</td>\n <td>NaN</td>\n <td>Taught yourself a new language, framework, or ...</td>\n <td>...</td>\n <td>Neutral</td>\n <td>Just as welcome now as I felt last year</td>\n <td>Tech articles written by other developers;Indu...</td>\n <td>Man</td>\n <td>No</td>\n <td>Straight / Heterosexual</td>\n <td>NaN</td>\n <td>No</td>\n <td>Appropriate in length</td>\n <td>Neither easy nor difficult</td>\n </tr>\n <tr>\n <th>1</th>\n <td>I am a student who is learning to code</td>\n <td>No</td>\n <td>2</td>\n <td>The quality of OSS and closed source software ...</td>\n <td>Not employed, but looking for work</td>\n <td>Bosnia and Herzegovina</td>\n <td>Yes, full-time</td>\n <td>Secondary school (e.g. American high school, G...</td>\n <td>NaN</td>\n <td>Taken an online course in programming or softw...</td>\n <td>...</td>\n <td>Yes, somewhat</td>\n <td>Just as welcome now as I felt last year</td>\n <td>Tech articles written by other developers;Indu...</td>\n <td>Man</td>\n <td>No</td>\n <td>Straight / Heterosexual</td>\n <td>NaN</td>\n <td>No</td>\n <td>Appropriate in length</td>\n <td>Neither easy nor difficult</td>\n </tr>\n <tr>\n <th>2</th>\n <td>I am not primarily a developer, but I write co...</td>\n <td>Yes</td>\n <td>1</td>\n <td>The quality of OSS and closed source software ...</td>\n <td>Employed full-time</td>\n <td>Thailand</td>\n <td>No</td>\n <td>Bachelor’s degree (BA, BS, B.Eng., etc.)</td>\n <td>Web development or web design</td>\n <td>Taught yourself a new language, framework, or ...</td>\n <td>...</td>\n <td>Neutral</td>\n <td>Just as welcome now as I felt last year</td>\n <td>Tech meetups or events in your area;Courses on...</td>\n <td>Man</td>\n <td>No</td>\n <td>Straight / Heterosexual</td>\n <td>NaN</td>\n <td>Yes</td>\n <td>Appropriate in length</td>\n <td>Neither easy nor difficult</td>\n </tr>\n <tr>\n <th>3</th>\n <td>I am a developer by profession</td>\n <td>No</td>\n <td>1</td>\n <td>The quality of OSS and closed source software ...</td>\n <td>Employed full-time</td>\n <td>United States</td>\n <td>No</td>\n <td>Bachelor’s degree (BA, BS, B.Eng., etc.)</td>\n <td>Computer science, computer engineering, or sof...</td>\n <td>Taken an online course in programming or softw...</td>\n <td>...</td>\n <td>No, not really</td>\n <td>Just as welcome now as I felt last year</td>\n <td>Tech articles written by other developers;Indu...</td>\n <td>Man</td>\n <td>No</td>\n <td>Straight / Heterosexual</td>\n <td>White or of European descent</td>\n <td>No</td>\n <td>Appropriate in length</td>\n <td>Easy</td>\n </tr>\n <tr>\n <th>4</th>\n <td>I am a developer by profession</td>\n <td>Yes</td>\n <td>4</td>\n <td>OSS is, on average, of HIGHER quality than pro...</td>\n <td>Employed full-time</td>\n <td>Ukraine</td>\n <td>No</td>\n <td>Bachelor’s degree (BA, BS, B.Eng., etc.)</td>\n <td>Computer science, computer engineering, or sof...</td>\n <td>Taken an online course in programming or softw...</td>\n <td>...</td>\n <td>Yes, definitely</td>\n <td>Just as welcome now as I felt last year</td>\n <td>Tech meetups or events in your area;Courses on...</td>\n <td>Man</td>\n <td>No</td>\n <td>Straight / Heterosexual</td>\n <td>White or of European descent;Multiracial</td>\n <td>No</td>\n <td>Appropriate in length</td>\n <td>Easy</td>\n </tr>\n </tbody>\n</table>\n<p>5 rows × 79 columns</p>\n</div>"
},
"metadata": {}
}
]
},
{
"metadata": {},
"cell_type": "markdown",
"source": "## Dealing with Categorical data - Approach II\n\n### 2. Label Encoding\n\nLabel encoding is a technique in which we basically replace each value in the categorical column with numbers from 0 to n-1. \n\nLet's say we have a list of names in a column. After label encoding the data in that column, each name will be assigned a numerical label. This approach will not be very efficient in every case because the model might make a mistake of considering the numerical values as the weight assigned to the data. \n\nThis approach is best suitable for ordinal data where the categorical data is labeled based on order For example, the attitude towards something (i.e. strongly agree, agree, disagree, strongly disagree) or clothing sizes (i.e. small, medium, large, extra large).The scikit-learn library provides **labelEncoder()** which helps in label encoding"
},
{
"metadata": {
"trusted": true
},
"cell_type": "code",
"source": "dataset = \"/developer_survey_2019/survey_results_public.csv\"",
"execution_count": 37,
"outputs": []
},
{
"metadata": {
"trusted": true
},
"cell_type": "code",
"source": "label_encoding = pd.read_csv(dataset, header = 0)",
"execution_count": 38,
"outputs": []
},
{
"metadata": {
"trusted": true
},
"cell_type": "code",
"source": "label_encoding.head()",
"execution_count": 39,
"outputs": [
{
"output_type": "execute_result",
"execution_count": 39,
"data": {
"text/plain": " Respondent MainBranch Hobbyist \\\n0 1 I am a student who is learning to code Yes \n1 2 I am a student who is learning to code No \n2 3 I am not primarily a developer, but I write co... Yes \n3 4 I am a developer by profession No \n4 5 I am a developer by profession Yes \n\n OpenSourcer \\\n0 Never \n1 Less than once per year \n2 Never \n3 Never \n4 Once a month or more often \n\n OpenSource \\\n0 The quality of OSS and closed source software ... \n1 The quality of OSS and closed source software ... \n2 The quality of OSS and closed source software ... \n3 The quality of OSS and closed source software ... \n4 OSS is, on average, of HIGHER quality than pro... \n\n Employment Country \\\n0 Not employed, and not looking for work United Kingdom \n1 Not employed, but looking for work Bosnia and Herzegovina \n2 Employed full-time Thailand \n3 Employed full-time United States \n4 Employed full-time Ukraine \n\n Student EdLevel \\\n0 No Primary/elementary school \n1 Yes, full-time Secondary school (e.g. American high school, G... \n2 No Bachelor’s degree (BA, BS, B.Eng., etc.) \n3 No Bachelor’s degree (BA, BS, B.Eng., etc.) \n4 No Bachelor’s degree (BA, BS, B.Eng., etc.) \n\n UndergradMajor \\\n0 NaN \n1 NaN \n2 Web development or web design \n3 Computer science, computer engineering, or sof... \n4 Computer science, computer engineering, or sof... \n\n ... WelcomeChange \\\n0 ... Just as welcome now as I felt last year \n1 ... Just as welcome now as I felt last year \n2 ... Just as welcome now as I felt last year \n3 ... Just as welcome now as I felt last year \n4 ... Just as welcome now as I felt last year \n\n SONewContent Age Gender Trans \\\n0 Tech articles written by other developers;Indu... 14.0 Man No \n1 Tech articles written by other developers;Indu... 19.0 Man No \n2 Tech meetups or events in your area;Courses on... 28.0 Man No \n3 Tech articles written by other developers;Indu... 22.0 Man No \n4 Tech meetups or events in your area;Courses on... 30.0 Man No \n\n Sexuality Ethnicity \\\n0 Straight / Heterosexual NaN \n1 Straight / Heterosexual NaN \n2 Straight / Heterosexual NaN \n3 Straight / Heterosexual White or of European descent \n4 Straight / Heterosexual White or of European descent;Multiracial \n\n Dependents SurveyLength SurveyEase \n0 No Appropriate in length Neither easy nor difficult \n1 No Appropriate in length Neither easy nor difficult \n2 Yes Appropriate in length Neither easy nor difficult \n3 No Appropriate in length Easy \n4 No Appropriate in length Easy \n\n[5 rows x 85 columns]",
"text/html": "<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead th {\n text-align: right;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>Respondent</th>\n <th>MainBranch</th>\n <th>Hobbyist</th>\n <th>OpenSourcer</th>\n <th>OpenSource</th>\n <th>Employment</th>\n <th>Country</th>\n <th>Student</th>\n <th>EdLevel</th>\n <th>UndergradMajor</th>\n <th>...</th>\n <th>WelcomeChange</th>\n <th>SONewContent</th>\n <th>Age</th>\n <th>Gender</th>\n <th>Trans</th>\n <th>Sexuality</th>\n <th>Ethnicity</th>\n <th>Dependents</th>\n <th>SurveyLength</th>\n <th>SurveyEase</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>0</th>\n <td>1</td>\n <td>I am a student who is learning to code</td>\n <td>Yes</td>\n <td>Never</td>\n <td>The quality of OSS and closed source software ...</td>\n <td>Not employed, and not looking for work</td>\n <td>United Kingdom</td>\n <td>No</td>\n <td>Primary/elementary school</td>\n <td>NaN</td>\n <td>...</td>\n <td>Just as welcome now as I felt last year</td>\n <td>Tech articles written by other developers;Indu...</td>\n <td>14.0</td>\n <td>Man</td>\n <td>No</td>\n <td>Straight / Heterosexual</td>\n <td>NaN</td>\n <td>No</td>\n <td>Appropriate in length</td>\n <td>Neither easy nor difficult</td>\n </tr>\n <tr>\n <th>1</th>\n <td>2</td>\n <td>I am a student who is learning to code</td>\n <td>No</td>\n <td>Less than once per year</td>\n <td>The quality of OSS and closed source software ...</td>\n <td>Not employed, but looking for work</td>\n <td>Bosnia and Herzegovina</td>\n <td>Yes, full-time</td>\n <td>Secondary school (e.g. American high school, G...</td>\n <td>NaN</td>\n <td>...</td>\n <td>Just as welcome now as I felt last year</td>\n <td>Tech articles written by other developers;Indu...</td>\n <td>19.0</td>\n <td>Man</td>\n <td>No</td>\n <td>Straight / Heterosexual</td>\n <td>NaN</td>\n <td>No</td>\n <td>Appropriate in length</td>\n <td>Neither easy nor difficult</td>\n </tr>\n <tr>\n <th>2</th>\n <td>3</td>\n <td>I am not primarily a developer, but I write co...</td>\n <td>Yes</td>\n <td>Never</td>\n <td>The quality of OSS and closed source software ...</td>\n <td>Employed full-time</td>\n <td>Thailand</td>\n <td>No</td>\n <td>Bachelor’s degree (BA, BS, B.Eng., etc.)</td>\n <td>Web development or web design</td>\n <td>...</td>\n <td>Just as welcome now as I felt last year</td>\n <td>Tech meetups or events in your area;Courses on...</td>\n <td>28.0</td>\n <td>Man</td>\n <td>No</td>\n <td>Straight / Heterosexual</td>\n <td>NaN</td>\n <td>Yes</td>\n <td>Appropriate in length</td>\n <td>Neither easy nor difficult</td>\n </tr>\n <tr>\n <th>3</th>\n <td>4</td>\n <td>I am a developer by profession</td>\n <td>No</td>\n <td>Never</td>\n <td>The quality of OSS and closed source software ...</td>\n <td>Employed full-time</td>\n <td>United States</td>\n <td>No</td>\n <td>Bachelor’s degree (BA, BS, B.Eng., etc.)</td>\n <td>Computer science, computer engineering, or sof...</td>\n <td>...</td>\n <td>Just as welcome now as I felt last year</td>\n <td>Tech articles written by other developers;Indu...</td>\n <td>22.0</td>\n <td>Man</td>\n <td>No</td>\n <td>Straight / Heterosexual</td>\n <td>White or of European descent</td>\n <td>No</td>\n <td>Appropriate in length</td>\n <td>Easy</td>\n </tr>\n <tr>\n <th>4</th>\n <td>5</td>\n <td>I am a developer by profession</td>\n <td>Yes</td>\n <td>Once a month or more often</td>\n <td>OSS is, on average, of HIGHER quality than pro...</td>\n <td>Employed full-time</td>\n <td>Ukraine</td>\n <td>No</td>\n <td>Bachelor’s degree (BA, BS, B.Eng., etc.)</td>\n <td>Computer science, computer engineering, or sof...</td>\n <td>...</td>\n <td>Just as welcome now as I felt last year</td>\n <td>Tech meetups or events in your area;Courses on...</td>\n <td>30.0</td>\n <td>Man</td>\n <td>No</td>\n <td>Straight / Heterosexual</td>\n <td>White or of European descent;Multiracial</td>\n <td>No</td>\n <td>Appropriate in length</td>\n <td>Easy</td>\n </tr>\n </tbody>\n</table>\n<p>5 rows × 85 columns</p>\n</div>"
},
"metadata": {}
}
]
},
{
"metadata": {
"trusted": true
},
"cell_type": "code",
"source": "label_encoding.isna().sum()",
"execution_count": 40,
"outputs": [
{
"output_type": "execute_result",
"execution_count": 40,
"data": {
"text/plain": "Respondent 0\nMainBranch 552\nHobbyist 0\nOpenSourcer 0\nOpenSource 2041\nEmployment 1702\nCountry 132\nStudent 1869\nEdLevel 2493\nUndergradMajor 13269\nEduOther 4623\nOrgSize 17092\nDevType 7548\nYearsCode 945\nAge1stCode 1249\nYearsCodePro 14552\nCareerSat 16036\nJobSat 17895\nMgrIdiot 27724\nMgrMoney 27726\nMgrWant 27651\nJobSeek 8328\nLastHireDate 9029\nLastInt 21728\nFizzBuzz 17539\nJobFactors 9512\nResumeUpdate 11006\nCurrencySymbol 17491\nCurrencyDesc 17491\nCompTotal 32938\n ... \nContainers 3517\nBlockchainOrg 40708\nBlockchainIs 28718\nBetterLife 2614\nITperson 1742\nOffOn 2220\nSocialMedia 4446\nExtraversion 1578\nScreenName 8397\nSOVisit1st 5006\nSOVisitFreq 620\nSOVisitTo 797\nSOFindAnswer 1067\nSOTimeSaved 2539\nSOHowMuchTime 20505\nSOAccount 1055\nSOPartFreq 14191\nSOJobs 817\nEntTeams 1042\nSOComm 752\nWelcomeChange 3028\nSONewContent 19323\nAge 9673\nGender 3477\nTrans 5276\nSexuality 12736\nEthnicity 12215\nDependents 5824\nSurveyLength 1899\nSurveyEase 1802\nLength: 85, dtype: int64"
},
"metadata": {}
}
]
},
{
"metadata": {},
"cell_type": "markdown",
"source": "#### 1. Before doing the label encoding, remove all the missing data. To do so use dropna()function"
},
{
"metadata": {
"trusted": true
},
"cell_type": "code",
"source": "label_encoding = label_encoding.dropna()",
"execution_count": 41,
"outputs": []
},
{
"metadata": {
"trusted": true
},
"cell_type": "code",
"source": "label_encoding.head()",
"execution_count": 42,
"outputs": [
{
"output_type": "execute_result",
"execution_count": 42,
"data": {
"text/plain": " Respondent MainBranch Hobbyist \\\n19 20 I am not primarily a developer, but I write co... No \n38 39 I am a developer by profession Yes \n43 44 I am a developer by profession Yes \n82 83 I am a developer by profession Yes \n103 104 I am a developer by profession Yes \n\n OpenSourcer \\\n19 Never \n38 Less than once per year \n43 Once a month or more often \n82 Less than once a month but more than once per ... \n103 Never \n\n OpenSource Employment \\\n19 OSS is, on average, of HIGHER quality than pro... Employed full-time \n38 The quality of OSS and closed source software ... Employed full-time \n43 The quality of OSS and closed source software ... Employed full-time \n82 OSS is, on average, of HIGHER quality than pro... Employed full-time \n103 OSS is, on average, of LOWER quality than prop... Employed full-time \n\n Country Student \\\n19 Lithuania No \n38 United States No \n43 Germany No \n82 India No \n103 India Yes, full-time \n\n EdLevel \\\n19 Master’s degree (MA, MS, M.Eng., MBA, etc.) \n38 Bachelor’s degree (BA, BS, B.Eng., etc.) \n43 Bachelor’s degree (BA, BS, B.Eng., etc.) \n82 Bachelor’s degree (BA, BS, B.Eng., etc.) \n103 Bachelor’s degree (BA, BS, B.Eng., etc.) \n\n UndergradMajor \\\n19 Information systems, information technology, o... \n38 Computer science, computer engineering, or sof... \n43 Information systems, information technology, o... \n82 Web development or web design \n103 Computer science, computer engineering, or sof... \n\n ... \\\n19 ... \n38 ... \n43 ... \n82 ... \n103 ... \n\n WelcomeChange \\\n19 Not applicable - I did not use Stack Overflow ... \n38 Somewhat less welcome now than last year \n43 Just as welcome now as I felt last year \n82 Just as welcome now as I felt last year \n103 Just as welcome now as I felt last year \n\n SONewContent Age Gender Trans \\\n19 Tech articles written by other developers 38.0 Man No \n38 Tech articles written by other developers 42.0 Man No \n43 Tech articles written by other developers;Indu... 43.0 Man No \n82 Industry news about technologies you're intere... 22.0 Man No \n103 Tech articles written by other developers;Indu... 29.0 Man No \n\n Sexuality Ethnicity Dependents \\\n19 Straight / Heterosexual White or of European descent Yes \n38 Bisexual White or of European descent No \n43 Straight / Heterosexual White or of European descent Yes \n82 Straight / Heterosexual South Asian No \n103 Straight / Heterosexual South Asian Yes \n\n SurveyLength SurveyEase \n19 Appropriate in length Easy \n38 Appropriate in length Easy \n43 Appropriate in length Easy \n82 Appropriate in length Neither easy nor difficult \n103 Appropriate in length Easy \n\n[5 rows x 85 columns]",
"text/html": "<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead th {\n text-align: right;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>Respondent</th>\n <th>MainBranch</th>\n <th>Hobbyist</th>\n <th>OpenSourcer</th>\n <th>OpenSource</th>\n <th>Employment</th>\n <th>Country</th>\n <th>Student</th>\n <th>EdLevel</th>\n <th>UndergradMajor</th>\n <th>...</th>\n <th>WelcomeChange</th>\n <th>SONewContent</th>\n <th>Age</th>\n <th>Gender</th>\n <th>Trans</th>\n <th>Sexuality</th>\n <th>Ethnicity</th>\n <th>Dependents</th>\n <th>SurveyLength</th>\n <th>SurveyEase</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>19</th>\n <td>20</td>\n <td>I am not primarily a developer, but I write co...</td>\n <td>No</td>\n <td>Never</td>\n <td>OSS is, on average, of HIGHER quality than pro...</td>\n <td>Employed full-time</td>\n <td>Lithuania</td>\n <td>No</td>\n <td>Master’s degree (MA, MS, M.Eng., MBA, etc.)</td>\n <td>Information systems, information technology, o...</td>\n <td>...</td>\n <td>Not applicable - I did not use Stack Overflow ...</td>\n <td>Tech articles written by other developers</td>\n <td>38.0</td>\n <td>Man</td>\n <td>No</td>\n <td>Straight / Heterosexual</td>\n <td>White or of European descent</td>\n <td>Yes</td>\n <td>Appropriate in length</td>\n <td>Easy</td>\n </tr>\n <tr>\n <th>38</th>\n <td>39</td>\n <td>I am a developer by profession</td>\n <td>Yes</td>\n <td>Less than once per year</td>\n <td>The quality of OSS and closed source software ...</td>\n <td>Employed full-time</td>\n <td>United States</td>\n <td>No</td>\n <td>Bachelor’s degree (BA, BS, B.Eng., etc.)</td>\n <td>Computer science, computer engineering, or sof...</td>\n <td>...</td>\n <td>Somewhat less welcome now than last year</td>\n <td>Tech articles written by other developers</td>\n <td>42.0</td>\n <td>Man</td>\n <td>No</td>\n <td>Bisexual</td>\n <td>White or of European descent</td>\n <td>No</td>\n <td>Appropriate in length</td>\n <td>Easy</td>\n </tr>\n <tr>\n <th>43</th>\n <td>44</td>\n <td>I am a developer by profession</td>\n <td>Yes</td>\n <td>Once a month or more often</td>\n <td>The quality of OSS and closed source software ...</td>\n <td>Employed full-time</td>\n <td>Germany</td>\n <td>No</td>\n <td>Bachelor’s degree (BA, BS, B.Eng., etc.)</td>\n <td>Information systems, information technology, o...</td>\n <td>...</td>\n <td>Just as welcome now as I felt last year</td>\n <td>Tech articles written by other developers;Indu...</td>\n <td>43.0</td>\n <td>Man</td>\n <td>No</td>\n <td>Straight / Heterosexual</td>\n <td>White or of European descent</td>\n <td>Yes</td>\n <td>Appropriate in length</td>\n <td>Easy</td>\n </tr>\n <tr>\n <th>82</th>\n <td>83</td>\n <td>I am a developer by profession</td>\n <td>Yes</td>\n <td>Less than once a month but more than once per ...</td>\n <td>OSS is, on average, of HIGHER quality than pro...</td>\n <td>Employed full-time</td>\n <td>India</td>\n <td>No</td>\n <td>Bachelor’s degree (BA, BS, B.Eng., etc.)</td>\n <td>Web development or web design</td>\n <td>...</td>\n <td>Just as welcome now as I felt last year</td>\n <td>Industry news about technologies you're intere...</td>\n <td>22.0</td>\n <td>Man</td>\n <td>No</td>\n <td>Straight / Heterosexual</td>\n <td>South Asian</td>\n <td>No</td>\n <td>Appropriate in length</td>\n <td>Neither easy nor difficult</td>\n </tr>\n <tr>\n <th>103</th>\n <td>104</td>\n <td>I am a developer by profession</td>\n <td>Yes</td>\n <td>Never</td>\n <td>OSS is, on average, of LOWER quality than prop...</td>\n <td>Employed full-time</td>\n <td>India</td>\n <td>Yes, full-time</td>\n <td>Bachelor’s degree (BA, BS, B.Eng., etc.)</td>\n <td>Computer science, computer engineering, or sof...</td>\n <td>...</td>\n <td>Just as welcome now as I felt last year</td>\n <td>Tech articles written by other developers;Indu...</td>\n <td>29.0</td>\n <td>Man</td>\n <td>No</td>\n <td>Straight / Heterosexual</td>\n <td>South Asian</td>\n <td>Yes</td>\n <td>Appropriate in length</td>\n <td>Easy</td>\n </tr>\n </tbody>\n</table>\n<p>5 rows × 85 columns</p>\n</div>"
},
"metadata": {}
}
]
},
{
"metadata": {
"trusted": true
},
"cell_type": "code",
"source": "label_encoding.isna().sum()",
"execution_count": 43,
"outputs": [
{
"output_type": "execute_result",
"execution_count": 43,
"data": {
"text/plain": "Respondent 0\nMainBranch 0\nHobbyist 0\nOpenSourcer 0\nOpenSource 0\nEmployment 0\nCountry 0\nStudent 0\nEdLevel 0\nUndergradMajor 0\nEduOther 0\nOrgSize 0\nDevType 0\nYearsCode 0\nAge1stCode 0\nYearsCodePro 0\nCareerSat 0\nJobSat 0\nMgrIdiot 0\nMgrMoney 0\nMgrWant 0\nJobSeek 0\nLastHireDate 0\nLastInt 0\nFizzBuzz 0\nJobFactors 0\nResumeUpdate 0\nCurrencySymbol 0\nCurrencyDesc 0\nCompTotal 0\n ..\nContainers 0\nBlockchainOrg 0\nBlockchainIs 0\nBetterLife 0\nITperson 0\nOffOn 0\nSocialMedia 0\nExtraversion 0\nScreenName 0\nSOVisit1st 0\nSOVisitFreq 0\nSOVisitTo 0\nSOFindAnswer 0\nSOTimeSaved 0\nSOHowMuchTime 0\nSOAccount 0\nSOPartFreq 0\nSOJobs 0\nEntTeams 0\nSOComm 0\nWelcomeChange 0\nSONewContent 0\nAge 0\nGender 0\nTrans 0\nSexuality 0\nEthnicity 0\nDependents 0\nSurveyLength 0\nSurveyEase 0\nLength: 85, dtype: int64"
},
"metadata": {}
}
]
},
{
"metadata": {},
"cell_type": "markdown",
"source": "#### 2. select the categorical data from the \"label_encoding\" dataframe and create a new dataframe with only the categorical data"
},
{
"metadata": {
"trusted": true
},
"cell_type": "code",
"source": "data_category = label_encoding.select_dtypes(exclude=[np.number]).columns",
"execution_count": 44,
"outputs": []
},
{
"metadata": {
"trusted": true
},
"cell_type": "code",
"source": "data_category",
"execution_count": 45,
"outputs": [
{
"output_type": "execute_result",
"execution_count": 45,
"data": {
"text/plain": "Index(['MainBranch', 'Hobbyist', 'OpenSourcer', 'OpenSource', 'Employment',\n 'Country', 'Student', 'EdLevel', 'UndergradMajor', 'EduOther',\n 'OrgSize', 'DevType', 'YearsCode', 'Age1stCode', 'YearsCodePro',\n 'CareerSat', 'JobSat', 'MgrIdiot', 'MgrMoney', 'MgrWant', 'JobSeek',\n 'LastHireDate', 'LastInt', 'FizzBuzz', 'JobFactors', 'ResumeUpdate',\n 'CurrencySymbol', 'CurrencyDesc', 'CompFreq', 'WorkPlan',\n 'WorkChallenge', 'WorkRemote', 'WorkLoc', 'ImpSyn', 'CodeRev',\n 'UnitTests', 'PurchaseHow', 'PurchaseWhat', 'LanguageWorkedWith',\n 'LanguageDesireNextYear', 'DatabaseWorkedWith',\n 'DatabaseDesireNextYear', 'PlatformWorkedWith',\n 'PlatformDesireNextYear', 'WebFrameWorkedWith',\n 'WebFrameDesireNextYear', 'MiscTechWorkedWith',\n 'MiscTechDesireNextYear', 'DevEnviron', 'OpSys', 'Containers',\n 'BlockchainOrg', 'BlockchainIs', 'BetterLife', 'ITperson', 'OffOn',\n 'SocialMedia', 'Extraversion', 'ScreenName', 'SOVisit1st',\n 'SOVisitFreq', 'SOVisitTo', 'SOFindAnswer', 'SOTimeSaved',\n 'SOHowMuchTime', 'SOAccount', 'SOPartFreq', 'SOJobs', 'EntTeams',\n 'SOComm', 'WelcomeChange', 'SONewContent', 'Gender', 'Trans',\n 'Sexuality', 'Ethnicity', 'Dependents', 'SurveyLength', 'SurveyEase'],\n dtype='object')"
},
"metadata": {}
}
]
},
{
"metadata": {
"trusted": true
},
"cell_type": "code",
"source": "label_encoding[data_category].head()",
"execution_count": 46,
"outputs": [
{
"output_type": "execute_result",
"execution_count": 46,
"data": {
"text/plain": " MainBranch Hobbyist \\\n19 I am not primarily a developer, but I write co... No \n38 I am a developer by profession Yes \n43 I am a developer by profession Yes \n82 I am a developer by profession Yes \n103 I am a developer by profession Yes \n\n OpenSourcer \\\n19 Never \n38 Less than once per year \n43 Once a month or more often \n82 Less than once a month but more than once per ... \n103 Never \n\n OpenSource Employment \\\n19 OSS is, on average, of HIGHER quality than pro... Employed full-time \n38 The quality of OSS and closed source software ... Employed full-time \n43 The quality of OSS and closed source software ... Employed full-time \n82 OSS is, on average, of HIGHER quality than pro... Employed full-time \n103 OSS is, on average, of LOWER quality than prop... Employed full-time \n\n Country Student \\\n19 Lithuania No \n38 United States No \n43 Germany No \n82 India No \n103 India Yes, full-time \n\n EdLevel \\\n19 Master’s degree (MA, MS, M.Eng., MBA, etc.) \n38 Bachelor’s degree (BA, BS, B.Eng., etc.) \n43 Bachelor’s degree (BA, BS, B.Eng., etc.) \n82 Bachelor’s degree (BA, BS, B.Eng., etc.) \n103 Bachelor’s degree (BA, BS, B.Eng., etc.) \n\n UndergradMajor \\\n19 Information systems, information technology, o... \n38 Computer science, computer engineering, or sof... \n43 Information systems, information technology, o... \n82 Web development or web design \n103 Computer science, computer engineering, or sof... \n\n EduOther \\\n19 Taken an online course in programming or softw... \n38 Taken an online course in programming or softw... \n43 Taken an online course in programming or softw... \n82 Taken an online course in programming or softw... \n103 Taken a part-time in-person course in programm... \n\n ... SOComm \\\n19 ... Neutral \n38 ... Yes, definitely \n43 ... Yes, somewhat \n82 ... Neutral \n103 ... Yes, definitely \n\n WelcomeChange \\\n19 Not applicable - I did not use Stack Overflow ... \n38 Somewhat less welcome now than last year \n43 Just as welcome now as I felt last year \n82 Just as welcome now as I felt last year \n103 Just as welcome now as I felt last year \n\n SONewContent Gender Trans \\\n19 Tech articles written by other developers Man No \n38 Tech articles written by other developers Man No \n43 Tech articles written by other developers;Indu... Man No \n82 Industry news about technologies you're intere... Man No \n103 Tech articles written by other developers;Indu... Man No \n\n Sexuality Ethnicity Dependents \\\n19 Straight / Heterosexual White or of European descent Yes \n38 Bisexual White or of European descent No \n43 Straight / Heterosexual White or of European descent Yes \n82 Straight / Heterosexual South Asian No \n103 Straight / Heterosexual South Asian Yes \n\n SurveyLength SurveyEase \n19 Appropriate in length Easy \n38 Appropriate in length Easy \n43 Appropriate in length Easy \n82 Appropriate in length Neither easy nor difficult \n103 Appropriate in length Easy \n\n[5 rows x 79 columns]",
"text/html": "<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead th {\n text-align: right;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>MainBranch</th>\n <th>Hobbyist</th>\n <th>OpenSourcer</th>\n <th>OpenSource</th>\n <th>Employment</th>\n <th>Country</th>\n <th>Student</th>\n <th>EdLevel</th>\n <th>UndergradMajor</th>\n <th>EduOther</th>\n <th>...</th>\n <th>SOComm</th>\n <th>WelcomeChange</th>\n <th>SONewContent</th>\n <th>Gender</th>\n <th>Trans</th>\n <th>Sexuality</th>\n <th>Ethnicity</th>\n <th>Dependents</th>\n <th>SurveyLength</th>\n <th>SurveyEase</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>19</th>\n <td>I am not primarily a developer, but I write co...</td>\n <td>No</td>\n <td>Never</td>\n <td>OSS is, on average, of HIGHER quality than pro...</td>\n <td>Employed full-time</td>\n <td>Lithuania</td>\n <td>No</td>\n <td>Master’s degree (MA, MS, M.Eng., MBA, etc.)</td>\n <td>Information systems, information technology, o...</td>\n <td>Taken an online course in programming or softw...</td>\n <td>...</td>\n <td>Neutral</td>\n <td>Not applicable - I did not use Stack Overflow ...</td>\n <td>Tech articles written by other developers</td>\n <td>Man</td>\n <td>No</td>\n <td>Straight / Heterosexual</td>\n <td>White or of European descent</td>\n <td>Yes</td>\n <td>Appropriate in length</td>\n <td>Easy</td>\n </tr>\n <tr>\n <th>38</th>\n <td>I am a developer by profession</td>\n <td>Yes</td>\n <td>Less than once per year</td>\n <td>The quality of OSS and closed source software ...</td>\n <td>Employed full-time</td>\n <td>United States</td>\n <td>No</td>\n <td>Bachelor’s degree (BA, BS, B.Eng., etc.)</td>\n <td>Computer science, computer engineering, or sof...</td>\n <td>Taken an online course in programming or softw...</td>\n <td>...</td>\n <td>Yes, definitely</td>\n <td>Somewhat less welcome now than last year</td>\n <td>Tech articles written by other developers</td>\n <td>Man</td>\n <td>No</td>\n <td>Bisexual</td>\n <td>White or of European descent</td>\n <td>No</td>\n <td>Appropriate in length</td>\n <td>Easy</td>\n </tr>\n <tr>\n <th>43</th>\n <td>I am a developer by profession</td>\n <td>Yes</td>\n <td>Once a month or more often</td>\n <td>The quality of OSS and closed source software ...</td>\n <td>Employed full-time</td>\n <td>Germany</td>\n <td>No</td>\n <td>Bachelor’s degree (BA, BS, B.Eng., etc.)</td>\n <td>Information systems, information technology, o...</td>\n <td>Taken an online course in programming or softw...</td>\n <td>...</td>\n <td>Yes, somewhat</td>\n <td>Just as welcome now as I felt last year</td>\n <td>Tech articles written by other developers;Indu...</td>\n <td>Man</td>\n <td>No</td>\n <td>Straight / Heterosexual</td>\n <td>White or of European descent</td>\n <td>Yes</td>\n <td>Appropriate in length</td>\n <td>Easy</td>\n </tr>\n <tr>\n <th>82</th>\n <td>I am a developer by profession</td>\n <td>Yes</td>\n <td>Less than once a month but more than once per ...</td>\n <td>OSS is, on average, of HIGHER quality than pro...</td>\n <td>Employed full-time</td>\n <td>India</td>\n <td>No</td>\n <td>Bachelor’s degree (BA, BS, B.Eng., etc.)</td>\n <td>Web development or web design</td>\n <td>Taken an online course in programming or softw...</td>\n <td>...</td>\n <td>Neutral</td>\n <td>Just as welcome now as I felt last year</td>\n <td>Industry news about technologies you're intere...</td>\n <td>Man</td>\n <td>No</td>\n <td>Straight / Heterosexual</td>\n <td>South Asian</td>\n <td>No</td>\n <td>Appropriate in length</td>\n <td>Neither easy nor difficult</td>\n </tr>\n <tr>\n <th>103</th>\n <td>I am a developer by profession</td>\n <td>Yes</td>\n <td>Never</td>\n <td>OSS is, on average, of LOWER quality than prop...</td>\n <td>Employed full-time</td>\n <td>India</td>\n <td>Yes, full-time</td>\n <td>Bachelor’s degree (BA, BS, B.Eng., etc.)</td>\n <td>Computer science, computer engineering, or sof...</td>\n <td>Taken a part-time in-person course in programm...</td>\n <td>...</td>\n <td>Yes, definitely</td>\n <td>Just as welcome now as I felt last year</td>\n <td>Tech articles written by other developers;Indu...</td>\n <td>Man</td>\n <td>No</td>\n <td>Straight / Heterosexual</td>\n <td>South Asian</td>\n <td>Yes</td>\n <td>Appropriate in length</td>\n <td>Easy</td>\n </tr>\n </tbody>\n</table>\n<p>5 rows × 79 columns</p>\n</div>"
},
"metadata": {}
}
]
},
{
"metadata": {},
"cell_type": "markdown",
"source": "#### 3. Iterate through this category column and convert it to numeric data using LabelEncoder(). To do so, import the sklearn.preprocessing package and use the LabelEncoder() class to transform the data:\n\nWe use fit_transform() here to apply the label encoder. Let's understand the advantage of using fit_transform().\n\n\nTo center the data (make it have zero mean and unit standard error), you subtract the mean and then divide the result by the standard deviation.\n\n x′=x−μ/σ\n \n\nYou do that on the training set of data. But then you have to apply the same transformation to your testing set (e.g. in cross-validation), or to newly obtained examples before forecast. But you have to use the same two parameters μ and σ (values) that you used for centering the training set.\n\nHence, every sklearn's transform's **fit()** just calculates the parameters (e.g. μ and σ in case of StandardScaler) and saves them as an internal objects state. Afterwards, you can call its **transform()** method to apply the transformation to a particular set of examples.\n\n**fit_transform()** joins these two steps and is used for the initial fitting of parameters on the training set x, but it also returns a transformed **x′**. Internally, it just calls first **fit()** and then **transform()** on the same data."
},
{
"metadata": {
"trusted": true
},
"cell_type": "code",
"source": "from sklearn.preprocessing import LabelEncoder\n\n# Creating the object instance for label encoder\n\nlabel_encoder = LabelEncoder()\nfor i in data_category:\n label_encoding[i] = label_encoder.fit_transform(label_encoding[i])\nlabel_encoding.head()",
"execution_count": 47,
"outputs": [
{
"output_type": "execute_result",
"execution_count": 47,
"data": {
"text/plain": " Respondent MainBranch Hobbyist OpenSourcer OpenSource Employment \\\n19 20 1 0 2 0 0 \n38 39 0 1 1 2 0 \n43 44 0 1 3 2 0 \n82 83 0 1 0 0 0 \n103 104 0 1 2 1 0 \n\n Country Student EdLevel UndergradMajor ... WelcomeChange \\\n19 56 0 2 9 ... 3 \n38 105 0 1 6 ... 4 \n43 34 0 1 9 ... 2 \n82 42 0 1 11 ... 2 \n103 42 1 1 6 ... 2 \n\n SONewContent Age Gender Trans Sexuality Ethnicity Dependents \\\n19 5 38.0 0 0 5 58 1 \n38 5 42.0 0 0 0 58 0 \n43 9 43.0 0 0 5 58 1 \n82 4 22.0 0 0 5 54 0 \n103 10 29.0 0 0 5 54 1 \n\n SurveyLength SurveyEase \n19 0 1 \n38 0 1 \n43 0 1 \n82 0 2 \n103 0 1 \n\n[5 rows x 85 columns]",
"text/html": "<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead th {\n text-align: right;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>Respondent</th>\n <th>MainBranch</th>\n <th>Hobbyist</th>\n <th>OpenSourcer</th>\n <th>OpenSource</th>\n <th>Employment</th>\n <th>Country</th>\n <th>Student</th>\n <th>EdLevel</th>\n <th>UndergradMajor</th>\n <th>...</th>\n <th>WelcomeChange</th>\n <th>SONewContent</th>\n <th>Age</th>\n <th>Gender</th>\n <th>Trans</th>\n <th>Sexuality</th>\n <th>Ethnicity</th>\n <th>Dependents</th>\n <th>SurveyLength</th>\n <th>SurveyEase</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>19</th>\n <td>20</td>\n <td>1</td>\n <td>0</td>\n <td>2</td>\n <td>0</td>\n <td>0</td>\n <td>56</td>\n <td>0</td>\n <td>2</td>\n <td>9</td>\n <td>...</td>\n <td>3</td>\n <td>5</td>\n <td>38.0</td>\n <td>0</td>\n <td>0</td>\n <td>5</td>\n <td>58</td>\n <td>1</td>\n <td>0</td>\n <td>1</td>\n </tr>\n <tr>\n <th>38</th>\n <td>39</td>\n <td>0</td>\n <td>1</td>\n <td>1</td>\n <td>2</td>\n <td>0</td>\n <td>105</td>\n <td>0</td>\n <td>1</td>\n <td>6</td>\n <td>...</td>\n <td>4</td>\n <td>5</td>\n <td>42.0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>58</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n </tr>\n <tr>\n <th>43</th>\n <td>44</td>\n <td>0</td>\n <td>1</td>\n <td>3</td>\n <td>2</td>\n <td>0</td>\n <td>34</td>\n <td>0</td>\n <td>1</td>\n <td>9</td>\n <td>...</td>\n <td>2</td>\n <td>9</td>\n <td>43.0</td>\n <td>0</td>\n <td>0</td>\n <td>5</td>\n <td>58</td>\n <td>1</td>\n <td>0</td>\n <td>1</td>\n </tr>\n <tr>\n <th>82</th>\n <td>83</td>\n <td>0</td>\n <td>1</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>42</td>\n <td>0</td>\n <td>1</td>\n <td>11</td>\n <td>...</td>\n <td>2</td>\n <td>4</td>\n <td>22.0</td>\n <td>0</td>\n <td>0</td>\n <td>5</td>\n <td>54</td>\n <td>0</td>\n <td>0</td>\n <td>2</td>\n </tr>\n <tr>\n <th>103</th>\n <td>104</td>\n <td>0</td>\n <td>1</td>\n <td>2</td>\n <td>1</td>\n <td>0</td>\n <td>42</td>\n <td>1</td>\n <td>1</td>\n <td>6</td>\n <td>...</td>\n <td>2</td>\n <td>10</td>\n <td>29.0</td>\n <td>0</td>\n <td>0</td>\n <td>5</td>\n <td>54</td>\n <td>1</td>\n <td>0</td>\n <td>1</td>\n </tr>\n </tbody>\n</table>\n<p>5 rows × 85 columns</p>\n</div>"
},
"metadata": {}
}
]
},
{
"metadata": {},
"cell_type": "markdown",
"source": "## Dealing with Categorical data - Approach III\n\n### 3. One-hot Encoding\n\nIn the previous approach we used label encoding for categorical data to convert it to numerical values. The values were assigned labels in the from 1,2,3. However in case of predictive modeling, the machine learning algorithm might make a mistake of considering these labels as some kind of order/weight. To avoid this confusion we use one-hot encoding.\n\nHow one-hot encoding works is the label-encoded data is further broken down into n columns, where n, denotes the total number of unique labels generated while performing lebel encoding.\n\nFor example, say a column has 3 unique labels, after performing one-hot encoding the column will further be divided into column_1, column_2, column_3 different columns"
},
{
"metadata": {
"trusted": true
},
"cell_type": "code",
"source": "import pandas as pd\nimport numpy as np\nfrom sklearn.preprocessing import OneHotEncoder",
"execution_count": 70,
"outputs": []
},
{
"metadata": {
"trusted": true
},
"cell_type": "code",
"source": "dataset_dummy_encoding = \"/developer_survey_2019/survey_results_public.csv\"",
"execution_count": 71,
"outputs": []
},
{
"metadata": {
"trusted": true
},
"cell_type": "code",
"source": "dummy_encoding = pd.read_csv(dataset_dummy_encoding,header = 0)",
"execution_count": 72,
"outputs": []
},
{
"metadata": {
"trusted": true
},
"cell_type": "code",
"source": "dummy_encoding.head()",
"execution_count": 73,
"outputs": [
{
"output_type": "execute_result",
"execution_count": 73,
"data": {
"text/plain": " Respondent MainBranch Hobbyist \\\n0 1 I am a student who is learning to code Yes \n1 2 I am a student who is learning to code No \n2 3 I am not primarily a developer, but I write co... Yes \n3 4 I am a developer by profession No \n4 5 I am a developer by profession Yes \n\n OpenSourcer \\\n0 Never \n1 Less than once per year \n2 Never \n3 Never \n4 Once a month or more often \n\n OpenSource \\\n0 The quality of OSS and closed source software ... \n1 The quality of OSS and closed source software ... \n2 The quality of OSS and closed source software ... \n3 The quality of OSS and closed source software ... \n4 OSS is, on average, of HIGHER quality than pro... \n\n Employment Country \\\n0 Not employed, and not looking for work United Kingdom \n1 Not employed, but looking for work Bosnia and Herzegovina \n2 Employed full-time Thailand \n3 Employed full-time United States \n4 Employed full-time Ukraine \n\n Student EdLevel \\\n0 No Primary/elementary school \n1 Yes, full-time Secondary school (e.g. American high school, G... \n2 No Bachelor’s degree (BA, BS, B.Eng., etc.) \n3 No Bachelor’s degree (BA, BS, B.Eng., etc.) \n4 No Bachelor’s degree (BA, BS, B.Eng., etc.) \n\n UndergradMajor \\\n0 NaN \n1 NaN \n2 Web development or web design \n3 Computer science, computer engineering, or sof... \n4 Computer science, computer engineering, or sof... \n\n ... WelcomeChange \\\n0 ... Just as welcome now as I felt last year \n1 ... Just as welcome now as I felt last year \n2 ... Just as welcome now as I felt last year \n3 ... Just as welcome now as I felt last year \n4 ... Just as welcome now as I felt last year \n\n SONewContent Age Gender Trans \\\n0 Tech articles written by other developers;Indu... 14.0 Man No \n1 Tech articles written by other developers;Indu... 19.0 Man No \n2 Tech meetups or events in your area;Courses on... 28.0 Man No \n3 Tech articles written by other developers;Indu... 22.0 Man No \n4 Tech meetups or events in your area;Courses on... 30.0 Man No \n\n Sexuality Ethnicity \\\n0 Straight / Heterosexual NaN \n1 Straight / Heterosexual NaN \n2 Straight / Heterosexual NaN \n3 Straight / Heterosexual White or of European descent \n4 Straight / Heterosexual White or of European descent;Multiracial \n\n Dependents SurveyLength SurveyEase \n0 No Appropriate in length Neither easy nor difficult \n1 No Appropriate in length Neither easy nor difficult \n2 Yes Appropriate in length Neither easy nor difficult \n3 No Appropriate in length Easy \n4 No Appropriate in length Easy \n\n[5 rows x 85 columns]",
"text/html": "<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead th {\n text-align: right;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>Respondent</th>\n <th>MainBranch</th>\n <th>Hobbyist</th>\n <th>OpenSourcer</th>\n <th>OpenSource</th>\n <th>Employment</th>\n <th>Country</th>\n <th>Student</th>\n <th>EdLevel</th>\n <th>UndergradMajor</th>\n <th>...</th>\n <th>WelcomeChange</th>\n <th>SONewContent</th>\n <th>Age</th>\n <th>Gender</th>\n <th>Trans</th>\n <th>Sexuality</th>\n <th>Ethnicity</th>\n <th>Dependents</th>\n <th>SurveyLength</th>\n <th>SurveyEase</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>0</th>\n <td>1</td>\n <td>I am a student who is learning to code</td>\n <td>Yes</td>\n <td>Never</td>\n <td>The quality of OSS and closed source software ...</td>\n <td>Not employed, and not looking for work</td>\n <td>United Kingdom</td>\n <td>No</td>\n <td>Primary/elementary school</td>\n <td>NaN</td>\n <td>...</td>\n <td>Just as welcome now as I felt last year</td>\n <td>Tech articles written by other developers;Indu...</td>\n <td>14.0</td>\n <td>Man</td>\n <td>No</td>\n <td>Straight / Heterosexual</td>\n <td>NaN</td>\n <td>No</td>\n <td>Appropriate in length</td>\n <td>Neither easy nor difficult</td>\n </tr>\n <tr>\n <th>1</th>\n <td>2</td>\n <td>I am a student who is learning to code</td>\n <td>No</td>\n <td>Less than once per year</td>\n <td>The quality of OSS and closed source software ...</td>\n <td>Not employed, but looking for work</td>\n <td>Bosnia and Herzegovina</td>\n <td>Yes, full-time</td>\n <td>Secondary school (e.g. American high school, G...</td>\n <td>NaN</td>\n <td>...</td>\n <td>Just as welcome now as I felt last year</td>\n <td>Tech articles written by other developers;Indu...</td>\n <td>19.0</td>\n <td>Man</td>\n <td>No</td>\n <td>Straight / Heterosexual</td>\n <td>NaN</td>\n <td>No</td>\n <td>Appropriate in length</td>\n <td>Neither easy nor difficult</td>\n </tr>\n <tr>\n <th>2</th>\n <td>3</td>\n <td>I am not primarily a developer, but I write co...</td>\n <td>Yes</td>\n <td>Never</td>\n <td>The quality of OSS and closed source software ...</td>\n <td>Employed full-time</td>\n <td>Thailand</td>\n <td>No</td>\n <td>Bachelor’s degree (BA, BS, B.Eng., etc.)</td>\n <td>Web development or web design</td>\n <td>...</td>\n <td>Just as welcome now as I felt last year</td>\n <td>Tech meetups or events in your area;Courses on...</td>\n <td>28.0</td>\n <td>Man</td>\n <td>No</td>\n <td>Straight / Heterosexual</td>\n <td>NaN</td>\n <td>Yes</td>\n <td>Appropriate in length</td>\n <td>Neither easy nor difficult</td>\n </tr>\n <tr>\n <th>3</th>\n <td>4</td>\n <td>I am a developer by profession</td>\n <td>No</td>\n <td>Never</td>\n <td>The quality of OSS and closed source software ...</td>\n <td>Employed full-time</td>\n <td>United States</td>\n <td>No</td>\n <td>Bachelor’s degree (BA, BS, B.Eng., etc.)</td>\n <td>Computer science, computer engineering, or sof...</td>\n <td>...</td>\n <td>Just as welcome now as I felt last year</td>\n <td>Tech articles written by other developers;Indu...</td>\n <td>22.0</td>\n <td>Man</td>\n <td>No</td>\n <td>Straight / Heterosexual</td>\n <td>White or of European descent</td>\n <td>No</td>\n <td>Appropriate in length</td>\n <td>Easy</td>\n </tr>\n <tr>\n <th>4</th>\n <td>5</td>\n <td>I am a developer by profession</td>\n <td>Yes</td>\n <td>Once a month or more often</td>\n <td>OSS is, on average, of HIGHER quality than pro...</td>\n <td>Employed full-time</td>\n <td>Ukraine</td>\n <td>No</td>\n <td>Bachelor’s degree (BA, BS, B.Eng., etc.)</td>\n <td>Computer science, computer engineering, or sof...</td>\n <td>...</td>\n <td>Just as welcome now as I felt last year</td>\n <td>Tech meetups or events in your area;Courses on...</td>\n <td>30.0</td>\n <td>Man</td>\n <td>No</td>\n <td>Straight / Heterosexual</td>\n <td>White or of European descent;Multiracial</td>\n <td>No</td>\n <td>Appropriate in length</td>\n <td>Easy</td>\n </tr>\n </tbody>\n</table>\n<p>5 rows × 85 columns</p>\n</div>"
},
"metadata": {}
}
]
},
{
"metadata": {},
"cell_type": "markdown",
"source": "#### 1. check for the \"NaN\" in the columns using the below code:"
},
{
"metadata": {
"trusted": true
},
"cell_type": "code",
"source": "dummy_encoding.isna().sum()",
"execution_count": 74,
"outputs": [
{
"output_type": "execute_result",
"execution_count": 74,
"data": {
"text/plain": "Respondent 0\nMainBranch 552\nHobbyist 0\nOpenSourcer 0\nOpenSource 2041\nEmployment 1702\nCountry 132\nStudent 1869\nEdLevel 2493\nUndergradMajor 13269\nEduOther 4623\nOrgSize 17092\nDevType 7548\nYearsCode 945\nAge1stCode 1249\nYearsCodePro 14552\nCareerSat 16036\nJobSat 17895\nMgrIdiot 27724\nMgrMoney 27726\nMgrWant 27651\nJobSeek 8328\nLastHireDate 9029\nLastInt 21728\nFizzBuzz 17539\nJobFactors 9512\nResumeUpdate 11006\nCurrencySymbol 17491\nCurrencyDesc 17491\nCompTotal 32938\n ... \nContainers 3517\nBlockchainOrg 40708\nBlockchainIs 28718\nBetterLife 2614\nITperson 1742\nOffOn 2220\nSocialMedia 4446\nExtraversion 1578\nScreenName 8397\nSOVisit1st 5006\nSOVisitFreq 620\nSOVisitTo 797\nSOFindAnswer 1067\nSOTimeSaved 2539\nSOHowMuchTime 20505\nSOAccount 1055\nSOPartFreq 14191\nSOJobs 817\nEntTeams 1042\nSOComm 752\nWelcomeChange 3028\nSONewContent 19323\nAge 9673\nGender 3477\nTrans 5276\nSexuality 12736\nEthnicity 12215\nDependents 5824\nSurveyLength 1899\nSurveyEase 1802\nLength: 85, dtype: int64"
},
"metadata": {}
}
]
},
{
"metadata": {},
"cell_type": "markdown",
"source": "#### 2. Drop the values having \"NaN\" using the dropna() function"
},
{
"metadata": {
"trusted": true
},
"cell_type": "code",
"source": "dummy_encoding = dummy_encoding.dropna()",
"execution_count": 75,
"outputs": []
},
{
"metadata": {
"trusted": true
},
"cell_type": "code",
"source": "dummy_encoding.isna().sum()",
"execution_count": 76,
"outputs": [
{
"output_type": "execute_result",
"execution_count": 76,
"data": {
"text/plain": "Respondent 0\nMainBranch 0\nHobbyist 0\nOpenSourcer 0\nOpenSource 0\nEmployment 0\nCountry 0\nStudent 0\nEdLevel 0\nUndergradMajor 0\nEduOther 0\nOrgSize 0\nDevType 0\nYearsCode 0\nAge1stCode 0\nYearsCodePro 0\nCareerSat 0\nJobSat 0\nMgrIdiot 0\nMgrMoney 0\nMgrWant 0\nJobSeek 0\nLastHireDate 0\nLastInt 0\nFizzBuzz 0\nJobFactors 0\nResumeUpdate 0\nCurrencySymbol 0\nCurrencyDesc 0\nCompTotal 0\n ..\nContainers 0\nBlockchainOrg 0\nBlockchainIs 0\nBetterLife 0\nITperson 0\nOffOn 0\nSocialMedia 0\nExtraversion 0\nScreenName 0\nSOVisit1st 0\nSOVisitFreq 0\nSOVisitTo 0\nSOFindAnswer 0\nSOTimeSaved 0\nSOHowMuchTime 0\nSOAccount 0\nSOPartFreq 0\nSOJobs 0\nEntTeams 0\nSOComm 0\nWelcomeChange 0\nSONewContent 0\nAge 0\nGender 0\nTrans 0\nSexuality 0\nEthnicity 0\nDependents 0\nSurveyLength 0\nSurveyEase 0\nLength: 85, dtype: int64"
},
"metadata": {}
}
]
},
{
"metadata": {},
"cell_type": "markdown",
"source": "#### 3. Select all the non-numeric columns and create a new dataframe for it"
},
{
"metadata": {
"trusted": true
},
"cell_type": "code",
"source": "dummy_encoding_categorical = dummy_encoding.select_dtypes(exclude=[np.number]).columns",
"execution_count": 77,
"outputs": []
},
{
"metadata": {
"trusted": true
},
"cell_type": "code",
"source": "dummy_encoding_categorical",
"execution_count": 78,
"outputs": [
{
"output_type": "execute_result",
"execution_count": 78,
"data": {
"text/plain": "Index(['MainBranch', 'Hobbyist', 'OpenSourcer', 'OpenSource', 'Employment',\n 'Country', 'Student', 'EdLevel', 'UndergradMajor', 'EduOther',\n 'OrgSize', 'DevType', 'YearsCode', 'Age1stCode', 'YearsCodePro',\n 'CareerSat', 'JobSat', 'MgrIdiot', 'MgrMoney', 'MgrWant', 'JobSeek',\n 'LastHireDate', 'LastInt', 'FizzBuzz', 'JobFactors', 'ResumeUpdate',\n 'CurrencySymbol', 'CurrencyDesc', 'CompFreq', 'WorkPlan',\n 'WorkChallenge', 'WorkRemote', 'WorkLoc', 'ImpSyn', 'CodeRev',\n 'UnitTests', 'PurchaseHow', 'PurchaseWhat', 'LanguageWorkedWith',\n 'LanguageDesireNextYear', 'DatabaseWorkedWith',\n 'DatabaseDesireNextYear', 'PlatformWorkedWith',\n 'PlatformDesireNextYear', 'WebFrameWorkedWith',\n 'WebFrameDesireNextYear', 'MiscTechWorkedWith',\n 'MiscTechDesireNextYear', 'DevEnviron', 'OpSys', 'Containers',\n 'BlockchainOrg', 'BlockchainIs', 'BetterLife', 'ITperson', 'OffOn',\n 'SocialMedia', 'Extraversion', 'ScreenName', 'SOVisit1st',\n 'SOVisitFreq', 'SOVisitTo', 'SOFindAnswer', 'SOTimeSaved',\n 'SOHowMuchTime', 'SOAccount', 'SOPartFreq', 'SOJobs', 'EntTeams',\n 'SOComm', 'WelcomeChange', 'SONewContent', 'Gender', 'Trans',\n 'Sexuality', 'Ethnicity', 'Dependents', 'SurveyLength', 'SurveyEase'],\n dtype='object')"
},
"metadata": {}
}
]
},
{
"metadata": {
"trusted": true
},
"cell_type": "code",
"source": "dummy_encoding[dummy_encoding_categorical].head()",
"execution_count": 79,
"outputs": [
{
"output_type": "execute_result",
"execution_count": 79,
"data": {
"text/plain": " MainBranch Hobbyist \\\n19 I am not primarily a developer, but I write co... No \n38 I am a developer by profession Yes \n43 I am a developer by profession Yes \n82 I am a developer by profession Yes \n103 I am a developer by profession Yes \n\n OpenSourcer \\\n19 Never \n38 Less than once per year \n43 Once a month or more often \n82 Less than once a month but more than once per ... \n103 Never \n\n OpenSource Employment \\\n19 OSS is, on average, of HIGHER quality than pro... Employed full-time \n38 The quality of OSS and closed source software ... Employed full-time \n43 The quality of OSS and closed source software ... Employed full-time \n82 OSS is, on average, of HIGHER quality than pro... Employed full-time \n103 OSS is, on average, of LOWER quality than prop... Employed full-time \n\n Country Student \\\n19 Lithuania No \n38 United States No \n43 Germany No \n82 India No \n103 India Yes, full-time \n\n EdLevel \\\n19 Master’s degree (MA, MS, M.Eng., MBA, etc.) \n38 Bachelor’s degree (BA, BS, B.Eng., etc.) \n43 Bachelor’s degree (BA, BS, B.Eng., etc.) \n82 Bachelor’s degree (BA, BS, B.Eng., etc.) \n103 Bachelor’s degree (BA, BS, B.Eng., etc.) \n\n UndergradMajor \\\n19 Information systems, information technology, o... \n38 Computer science, computer engineering, or sof... \n43 Information systems, information technology, o... \n82 Web development or web design \n103 Computer science, computer engineering, or sof... \n\n EduOther \\\n19 Taken an online course in programming or softw... \n38 Taken an online course in programming or softw... \n43 Taken an online course in programming or softw... \n82 Taken an online course in programming or softw... \n103 Taken a part-time in-person course in programm... \n\n ... SOComm \\\n19 ... Neutral \n38 ... Yes, definitely \n43 ... Yes, somewhat \n82 ... Neutral \n103 ... Yes, definitely \n\n WelcomeChange \\\n19 Not applicable - I did not use Stack Overflow ... \n38 Somewhat less welcome now than last year \n43 Just as welcome now as I felt last year \n82 Just as welcome now as I felt last year \n103 Just as welcome now as I felt last year \n\n SONewContent Gender Trans \\\n19 Tech articles written by other developers Man No \n38 Tech articles written by other developers Man No \n43 Tech articles written by other developers;Indu... Man No \n82 Industry news about technologies you're intere... Man No \n103 Tech articles written by other developers;Indu... Man No \n\n Sexuality Ethnicity Dependents \\\n19 Straight / Heterosexual White or of European descent Yes \n38 Bisexual White or of European descent No \n43 Straight / Heterosexual White or of European descent Yes \n82 Straight / Heterosexual South Asian No \n103 Straight / Heterosexual South Asian Yes \n\n SurveyLength SurveyEase \n19 Appropriate in length Easy \n38 Appropriate in length Easy \n43 Appropriate in length Easy \n82 Appropriate in length Neither easy nor difficult \n103 Appropriate in length Easy \n\n[5 rows x 79 columns]",
"text/html": "<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead th {\n text-align: right;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>MainBranch</th>\n <th>Hobbyist</th>\n <th>OpenSourcer</th>\n <th>OpenSource</th>\n <th>Employment</th>\n <th>Country</th>\n <th>Student</th>\n <th>EdLevel</th>\n <th>UndergradMajor</th>\n <th>EduOther</th>\n <th>...</th>\n <th>SOComm</th>\n <th>WelcomeChange</th>\n <th>SONewContent</th>\n <th>Gender</th>\n <th>Trans</th>\n <th>Sexuality</th>\n <th>Ethnicity</th>\n <th>Dependents</th>\n <th>SurveyLength</th>\n <th>SurveyEase</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>19</th>\n <td>I am not primarily a developer, but I write co...</td>\n <td>No</td>\n <td>Never</td>\n <td>OSS is, on average, of HIGHER quality than pro...</td>\n <td>Employed full-time</td>\n <td>Lithuania</td>\n <td>No</td>\n <td>Master’s degree (MA, MS, M.Eng., MBA, etc.)</td>\n <td>Information systems, information technology, o...</td>\n <td>Taken an online course in programming or softw...</td>\n <td>...</td>\n <td>Neutral</td>\n <td>Not applicable - I did not use Stack Overflow ...</td>\n <td>Tech articles written by other developers</td>\n <td>Man</td>\n <td>No</td>\n <td>Straight / Heterosexual</td>\n <td>White or of European descent</td>\n <td>Yes</td>\n <td>Appropriate in length</td>\n <td>Easy</td>\n </tr>\n <tr>\n <th>38</th>\n <td>I am a developer by profession</td>\n <td>Yes</td>\n <td>Less than once per year</td>\n <td>The quality of OSS and closed source software ...</td>\n <td>Employed full-time</td>\n <td>United States</td>\n <td>No</td>\n <td>Bachelor’s degree (BA, BS, B.Eng., etc.)</td>\n <td>Computer science, computer engineering, or sof...</td>\n <td>Taken an online course in programming or softw...</td>\n <td>...</td>\n <td>Yes, definitely</td>\n <td>Somewhat less welcome now than last year</td>\n <td>Tech articles written by other developers</td>\n <td>Man</td>\n <td>No</td>\n <td>Bisexual</td>\n <td>White or of European descent</td>\n <td>No</td>\n <td>Appropriate in length</td>\n <td>Easy</td>\n </tr>\n <tr>\n <th>43</th>\n <td>I am a developer by profession</td>\n <td>Yes</td>\n <td>Once a month or more often</td>\n <td>The quality of OSS and closed source software ...</td>\n <td>Employed full-time</td>\n <td>Germany</td>\n <td>No</td>\n <td>Bachelor’s degree (BA, BS, B.Eng., etc.)</td>\n <td>Information systems, information technology, o...</td>\n <td>Taken an online course in programming or softw...</td>\n <td>...</td>\n <td>Yes, somewhat</td>\n <td>Just as welcome now as I felt last year</td>\n <td>Tech articles written by other developers;Indu...</td>\n <td>Man</td>\n <td>No</td>\n <td>Straight / Heterosexual</td>\n <td>White or of European descent</td>\n <td>Yes</td>\n <td>Appropriate in length</td>\n <td>Easy</td>\n </tr>\n <tr>\n <th>82</th>\n <td>I am a developer by profession</td>\n <td>Yes</td>\n <td>Less than once a month but more than once per ...</td>\n <td>OSS is, on average, of HIGHER quality than pro...</td>\n <td>Employed full-time</td>\n <td>India</td>\n <td>No</td>\n <td>Bachelor’s degree (BA, BS, B.Eng., etc.)</td>\n <td>Web development or web design</td>\n <td>Taken an online course in programming or softw...</td>\n <td>...</td>\n <td>Neutral</td>\n <td>Just as welcome now as I felt last year</td>\n <td>Industry news about technologies you're intere...</td>\n <td>Man</td>\n <td>No</td>\n <td>Straight / Heterosexual</td>\n <td>South Asian</td>\n <td>No</td>\n <td>Appropriate in length</td>\n <td>Neither easy nor difficult</td>\n </tr>\n <tr>\n <th>103</th>\n <td>I am a developer by profession</td>\n <td>Yes</td>\n <td>Never</td>\n <td>OSS is, on average, of LOWER quality than prop...</td>\n <td>Employed full-time</td>\n <td>India</td>\n <td>Yes, full-time</td>\n <td>Bachelor’s degree (BA, BS, B.Eng., etc.)</td>\n <td>Computer science, computer engineering, or sof...</td>\n <td>Taken a part-time in-person course in programm...</td>\n <td>...</td>\n <td>Yes, definitely</td>\n <td>Just as welcome now as I felt last year</td>\n <td>Tech articles written by other developers;Indu...</td>\n <td>Man</td>\n <td>No</td>\n <td>Straight / Heterosexual</td>\n <td>South Asian</td>\n <td>Yes</td>\n <td>Appropriate in length</td>\n <td>Easy</td>\n </tr>\n </tbody>\n</table>\n<p>5 rows × 79 columns</p>\n</div>"
},
"metadata": {}
}
]
},
{
"metadata": {},
"cell_type": "markdown",
"source": "#### 4. For every level or category, a new column is created. In order to prefix the category name with the column name you can use this alternate way to create one-hot encoding. In order to prefix the category name with the column name, write the following code:"
},
{
"metadata": {
"trusted": true
},
"cell_type": "code",
"source": "dummy_onehot_encoding = pd.get_dummies(dummy_encoding[dummy_encoding_categorical],prefix=dummy_encoding_categorical)",
"execution_count": 80,
"outputs": []
},
{
"metadata": {
"trusted": true
},
"cell_type": "code",
"source": "dummy_onehot_encoding.head()",
"execution_count": 81,
"outputs": [
{
"output_type": "execute_result",
"execution_count": 81,
"data": {
"text/plain": " MainBranch_I am a developer by profession \\\n19 0 \n38 1 \n43 1 \n82 1 \n103 1 \n\n MainBranch_I am not primarily a developer, but I write code sometimes as part of my work \\\n19 1 \n38 0 \n43 0 \n82 0 \n103 0 \n\n Hobbyist_No Hobbyist_Yes \\\n19 1 0 \n38 0 1 \n43 0 1 \n82 0 1 \n103 0 1 \n\n OpenSourcer_Less than once a month but more than once per year \\\n19 0 \n38 0 \n43 0 \n82 1 \n103 0 \n\n OpenSourcer_Less than once per year OpenSourcer_Never \\\n19 0 1 \n38 1 0 \n43 0 0 \n82 0 0 \n103 0 1 \n\n OpenSourcer_Once a month or more often \\\n19 0 \n38 0 \n43 1 \n82 0 \n103 0 \n\n OpenSource_OSS is, on average, of HIGHER quality than proprietary / closed source software \\\n19 1 \n38 0 \n43 0 \n82 1 \n103 0 \n\n OpenSource_OSS is, on average, of LOWER quality than proprietary / closed source software \\\n19 0 \n38 0 \n43 0 \n82 0 \n103 1 \n\n ... \\\n19 ... \n38 ... \n43 ... \n82 ... \n103 ... \n\n Ethnicity_White or of European descent;Biracial;Multiracial \\\n19 0 \n38 0 \n43 0 \n82 0 \n103 0 \n\n Ethnicity_White or of European descent;Multiracial Dependents_No \\\n19 0 0 \n38 0 1 \n43 0 0 \n82 0 1 \n103 0 0 \n\n Dependents_Yes SurveyLength_Appropriate in length \\\n19 1 1 \n38 0 1 \n43 1 1 \n82 0 1 \n103 1 1 \n\n SurveyLength_Too long SurveyLength_Too short SurveyEase_Difficult \\\n19 0 0 0 \n38 0 0 0 \n43 0 0 0 \n82 0 0 0 \n103 0 0 0 \n\n SurveyEase_Easy SurveyEase_Neither easy nor difficult \n19 1 0 \n38 1 0 \n43 1 0 \n82 0 1 \n103 1 0 \n\n[5 rows x 16051 columns]",
"text/html": "<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead th {\n text-align: right;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>MainBranch_I am a developer by profession</th>\n <th>MainBranch_I am not primarily a developer, but I write code sometimes as part of my work</th>\n <th>Hobbyist_No</th>\n <th>Hobbyist_Yes</th>\n <th>OpenSourcer_Less than once a month but more than once per year</th>\n <th>OpenSourcer_Less than once per year</th>\n <th>OpenSourcer_Never</th>\n <th>OpenSourcer_Once a month or more often</th>\n <th>OpenSource_OSS is, on average, of HIGHER quality than proprietary / closed source software</th>\n <th>OpenSource_OSS is, on average, of LOWER quality than proprietary / closed source software</th>\n <th>...</th>\n <th>Ethnicity_White or of European descent;Biracial;Multiracial</th>\n <th>Ethnicity_White or of European descent;Multiracial</th>\n <th>Dependents_No</th>\n <th>Dependents_Yes</th>\n <th>SurveyLength_Appropriate in length</th>\n <th>SurveyLength_Too long</th>\n <th>SurveyLength_Too short</th>\n <th>SurveyEase_Difficult</th>\n <th>SurveyEase_Easy</th>\n <th>SurveyEase_Neither easy nor difficult</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>19</th>\n <td>0</td>\n <td>1</td>\n <td>1</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n <td>0</td>\n <td>1</td>\n <td>0</td>\n <td>...</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n <td>1</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n <td>0</td>\n </tr>\n <tr>\n <th>38</th>\n <td>1</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n <td>0</td>\n <td>1</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>...</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n <td>0</td>\n <td>1</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n <td>0</td>\n </tr>\n <tr>\n <th>43</th>\n <td>1</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n <td>0</td>\n <td>0</td>\n <td>...</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n <td>1</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n <td>0</td>\n </tr>\n <tr>\n <th>82</th>\n <td>1</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n <td>1</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n <td>0</td>\n <td>...</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n <td>0</td>\n <td>1</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n </tr>\n <tr>\n <th>103</th>\n <td>1</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n <td>...</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n <td>1</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n <td>0</td>\n </tr>\n </tbody>\n</table>\n<p>5 rows × 16051 columns</p>\n</div>"
},
"metadata": {}
}
]
},
{
"metadata": {
"trusted": true
},
"cell_type": "code",
"source": "dummy_onehot_encoding.columns",
"execution_count": 82,
"outputs": [
{
"output_type": "execute_result",
"execution_count": 82,
"data": {
"text/plain": "Index(['MainBranch_I am a developer by profession',\n 'MainBranch_I am not primarily a developer, but I write code sometimes as part of my work',\n 'Hobbyist_No', 'Hobbyist_Yes',\n 'OpenSourcer_Less than once a month but more than once per year',\n 'OpenSourcer_Less than once per year', 'OpenSourcer_Never',\n 'OpenSourcer_Once a month or more often',\n 'OpenSource_OSS is, on average, of HIGHER quality than proprietary / closed source software',\n 'OpenSource_OSS is, on average, of LOWER quality than proprietary / closed source software',\n ...\n 'Ethnicity_White or of European descent;Biracial;Multiracial',\n 'Ethnicity_White or of European descent;Multiracial', 'Dependents_No',\n 'Dependents_Yes', 'SurveyLength_Appropriate in length',\n 'SurveyLength_Too long', 'SurveyLength_Too short',\n 'SurveyEase_Difficult', 'SurveyEase_Easy',\n 'SurveyEase_Neither easy nor difficult'],\n dtype='object', length=16051)"
},
"metadata": {}
}
]
}
],
"metadata": {
"kernelspec": {
"name": "python3",
"display_name": "Python 3",
"language": "python"
},
"language_info": {
"name": "python",
"version": "3.7.0",
"mimetype": "text/x-python",
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"pygments_lexer": "ipython3",
"nbconvert_exporter": "python",
"file_extension": ".py"
},
"gist": {
"id": "2e099b81190df040cd93286aa42f411d",
"data": {
"description": "Data Cleaning.ipynb",
"public": true
}
},
"_draft": {
"nbviewer_url": "https://gist.github.com/2e099b81190df040cd93286aa42f411d"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment