Last active
July 12, 2019 06:18
-
-
Save RandomForestGump/2ef96699de58f50b236204c3c47a2e71 to your computer and use it in GitHub Desktop.
Gender_Classification
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
{ | |
"cells": [ | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"In this section, we will determine,for a given author or authors, whether there are differences in portrail of characters based on gender. \n", | |
"\n", | |
"Our final task is to define a function that guesses the gender of a character based on his or her name. \n", | |
"\n", | |
"The file `names.csv` contains several thousand male and female names with information about how frequent each name is. This information is extracted online. " | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"### Novel Used: Alice in the wonderland" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 7, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"## Function to create gender map from names.csv file\n", | |
"def create_gender_map(dict_reader):\n", | |
" names_info = defaultdict(lambda: {\"gender\":\"\", \"freq\": 0.0})\n", | |
" for row in input_file:\n", | |
" name = row[\"name\"].lower()\n", | |
" if names_info[name][\"freq\"] < float(row[\"freq\"]): # is this gender more frequent?\n", | |
" names_info[name][\"gender\"] = row[\"gender\"] \n", | |
" names_info[name][\"freq\"] = float(row[\"freq\"])\n", | |
" gender_map = defaultdict(lambda: \"unknown\")\n", | |
" for name in names_info:\n", | |
" gender_map[name] = names_info[name][\"gender\"]\n", | |
" return gender_map\n", | |
"\n", | |
"os.chdir('C://Users//cmm//Desktop')\n", | |
"\n", | |
"input_file = csv.DictReader(open('names.csv')) ## Importing our names.csv file\n", | |
"gender_map = create_gender_map(input_file) ## Import the gender map\n", | |
"#### Male homonyms\n", | |
"male_title=['mr.','sir','monsieur','captain','chief','master','lord','baron','mister','mr','prince','king']\n", | |
"#### Female homonyms\n", | |
"female_title=['mrs.','ms.','miss','lady','madameoiselle','baroness','mistress','mrs','ms','queen','princess','madam','madame']\n", | |
"\n", | |
"\n", | |
"def gender_guess(name,gender_map): #Identifying entries in the names.csv database#\n", | |
" if (len(name.split()))==1:\n", | |
" if name.lower() in gender_map.keys():\n", | |
" return gender_map[name]\n", | |
" else:\n", | |
" return('unknown')\n", | |
" \n", | |
" if(len(name.split()))>1: \n", | |
" name_array=name.lower().split()\n", | |
" if name_array[0] in gender_map.keys():\n", | |
" return gender_map[name_array[0]]\n", | |
" \n", | |
" \n", | |
" for title in name_array: #Recognising titles of entries# \n", | |
" if title in male_title:\n", | |
" return 'male'\n", | |
" elif title in female_title:\n", | |
" return 'female'\n", | |
" else: \n", | |
" return('unknown')\n", | |
" break\n", | |
" \n", | |
"def named_entity_counts(document,named_entity_label): \n", | |
" \n", | |
"## Function that outputs a Counter object of human entities found\n", | |
"\n", | |
" occurrences = [ent.string.strip() for ent in document.ents \n", | |
" if ent.label_ == named_entity_label and ent.string.strip()]\n", | |
" return Counter(occurrences)\n", | |
"\n", | |
"alice = gutenberg.raw(fileids='carroll-alice.txt')\n", | |
"parsed_alice=nlp(alice)\n", | |
"text = parsed_alice ### Parsing Alice in the wonderland by Lewis Carroll\n", | |
"entity_type = 'PERSON' ## Type of entry\n", | |
"number_of_entities = 10 ### Control over obtaining number of defined type entities\n", | |
"Entities=pd.DataFrame(named_entity_counts(text,entity_type).most_common(number_of_entities),columns=[\"Entity\",\"Count\"])\n", | |
"entity=[]\n", | |
"for char in Entities['Entity']:\n", | |
" entity.append(gender_guess(char.lower(),gender_map))\n", | |
"Entities['Pred_Gender']=entity\n" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 8, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>Entity</th>\n", | |
" <th>Count</th>\n", | |
" <th>Pred_Gender</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>0</th>\n", | |
" <td>Alice</td>\n", | |
" <td>388</td>\n", | |
" <td>female</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>1</th>\n", | |
" <td>Gryphon</td>\n", | |
" <td>32</td>\n", | |
" <td>unknown</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>2</th>\n", | |
" <td>Queen</td>\n", | |
" <td>27</td>\n", | |
" <td>female</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>3</th>\n", | |
" <td>Mouse</td>\n", | |
" <td>25</td>\n", | |
" <td>unknown</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>4</th>\n", | |
" <td>Rabbit</td>\n", | |
" <td>20</td>\n", | |
" <td>unknown</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>5</th>\n", | |
" <td>Bill</td>\n", | |
" <td>12</td>\n", | |
" <td>male</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>6</th>\n", | |
" <td>Dormouse</td>\n", | |
" <td>8</td>\n", | |
" <td>unknown</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>7</th>\n", | |
" <td>Dinah</td>\n", | |
" <td>8</td>\n", | |
" <td>female</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>8</th>\n", | |
" <td>William</td>\n", | |
" <td>5</td>\n", | |
" <td>male</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>9</th>\n", | |
" <td>Beau</td>\n", | |
" <td>4</td>\n", | |
" <td>male</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" Entity Count Pred_Gender\n", | |
"0 Alice 388 female\n", | |
"1 Gryphon 32 unknown\n", | |
"2 Queen 27 female\n", | |
"3 Mouse 25 unknown\n", | |
"4 Rabbit 20 unknown\n", | |
"5 Bill 12 male\n", | |
"6 Dormouse 8 unknown\n", | |
"7 Dinah 8 female\n", | |
"8 William 5 male\n", | |
"9 Beau 4 male" | |
] | |
}, | |
"execution_count": 8, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"Entities" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"### Evaluating the Gender classifier for different entities" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"#### Male type Entities" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 9, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"male\n", | |
"male\n", | |
"male\n", | |
"male\n", | |
"male\n", | |
"male\n", | |
"male\n" | |
] | |
} | |
], | |
"source": [ | |
"names=['harry','abdul','homer','gary','robert','wayne','lionel']\n", | |
"for name in names:\n", | |
" print(gender_guess(name,gender_map))" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"#### Female type entities" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 10, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"female\n", | |
"female\n", | |
"female\n", | |
"female\n", | |
"female\n", | |
"female\n", | |
"female\n" | |
] | |
} | |
], | |
"source": [ | |
"names=['martha','holly','nicole','catherine','ruth','april','christina']\n", | |
"for name in names:\n", | |
" print(gender_guess(name,gender_map))" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"#### For first and last names given" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 11, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"female\n", | |
"female\n", | |
"male\n", | |
"male\n", | |
"male\n", | |
"male\n" | |
] | |
} | |
], | |
"source": [ | |
"names=['Liz Lemon','Leslie Knope','jesus navas','Robert Lewandowski','Anthony Martial','Wesley Sneijder']\n", | |
"for name in names:\n", | |
" print(gender_guess(name,gender_map))" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"#### For names with titles" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 12, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"male\n", | |
"female\n", | |
"male\n" | |
] | |
} | |
], | |
"source": [ | |
"print(gender_guess('Sir Alex Ferguson',gender_map))\n", | |
"print(gender_guess('Lady McElroy',gender_map))\n", | |
"print(gender_guess('Friedrich Nietzsche',gender_map))" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"#### The program returns 'unknown' if the gender of the entity can't be determined by the function created. This error is a result of the name not being in the names.csv folder or doesn't have a gender bisecting title attached to it" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 13, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"unknown\n", | |
"unknown\n", | |
"unknown\n" | |
] | |
} | |
], | |
"source": [ | |
"print(gender_guess('Liam Neeson',gender_map))\n", | |
"print(gender_guess('Mahatma Gandhi',gender_map))\n", | |
"print(gender_guess('Pricella McCartney',gender_map))" | |
] | |
} | |
], | |
"metadata": { | |
"kernelspec": { | |
"display_name": "Python 3", | |
"language": "python", | |
"name": "python3" | |
}, | |
"language_info": { | |
"codemirror_mode": { | |
"name": "ipython", | |
"version": 3 | |
}, | |
"file_extension": ".py", | |
"mimetype": "text/x-python", | |
"name": "python", | |
"nbconvert_exporter": "python", | |
"pygments_lexer": "ipython3", | |
"version": "3.6.5" | |
} | |
}, | |
"nbformat": 4, | |
"nbformat_minor": 1 | |
} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment