Skip to content

Instantly share code, notes, and snippets.

@RandomForestGump
Last active July 12, 2019 06:18
Show Gist options
  • Save RandomForestGump/2ef96699de58f50b236204c3c47a2e71 to your computer and use it in GitHub Desktop.
Save RandomForestGump/2ef96699de58f50b236204c3c47a2e71 to your computer and use it in GitHub Desktop.
Gender_Classification
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In this section, we will determine,for a given author or authors, whether there are differences in portrail of characters based on gender. \n",
"\n",
"Our final task is to define a function that guesses the gender of a character based on his or her name. \n",
"\n",
"The file `names.csv` contains several thousand male and female names with information about how frequent each name is. This information is extracted online. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Novel Used: Alice in the wonderland"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [],
"source": [
"## Function to create gender map from names.csv file\n",
"def create_gender_map(dict_reader):\n",
" names_info = defaultdict(lambda: {\"gender\":\"\", \"freq\": 0.0})\n",
" for row in input_file:\n",
" name = row[\"name\"].lower()\n",
" if names_info[name][\"freq\"] < float(row[\"freq\"]): # is this gender more frequent?\n",
" names_info[name][\"gender\"] = row[\"gender\"] \n",
" names_info[name][\"freq\"] = float(row[\"freq\"])\n",
" gender_map = defaultdict(lambda: \"unknown\")\n",
" for name in names_info:\n",
" gender_map[name] = names_info[name][\"gender\"]\n",
" return gender_map\n",
"\n",
"os.chdir('C://Users//cmm//Desktop')\n",
"\n",
"input_file = csv.DictReader(open('names.csv')) ## Importing our names.csv file\n",
"gender_map = create_gender_map(input_file) ## Import the gender map\n",
"#### Male homonyms\n",
"male_title=['mr.','sir','monsieur','captain','chief','master','lord','baron','mister','mr','prince','king']\n",
"#### Female homonyms\n",
"female_title=['mrs.','ms.','miss','lady','madameoiselle','baroness','mistress','mrs','ms','queen','princess','madam','madame']\n",
"\n",
"\n",
"def gender_guess(name,gender_map): #Identifying entries in the names.csv database#\n",
" if (len(name.split()))==1:\n",
" if name.lower() in gender_map.keys():\n",
" return gender_map[name]\n",
" else:\n",
" return('unknown')\n",
" \n",
" if(len(name.split()))>1: \n",
" name_array=name.lower().split()\n",
" if name_array[0] in gender_map.keys():\n",
" return gender_map[name_array[0]]\n",
" \n",
" \n",
" for title in name_array: #Recognising titles of entries# \n",
" if title in male_title:\n",
" return 'male'\n",
" elif title in female_title:\n",
" return 'female'\n",
" else: \n",
" return('unknown')\n",
" break\n",
" \n",
"def named_entity_counts(document,named_entity_label): \n",
" \n",
"## Function that outputs a Counter object of human entities found\n",
"\n",
" occurrences = [ent.string.strip() for ent in document.ents \n",
" if ent.label_ == named_entity_label and ent.string.strip()]\n",
" return Counter(occurrences)\n",
"\n",
"alice = gutenberg.raw(fileids='carroll-alice.txt')\n",
"parsed_alice=nlp(alice)\n",
"text = parsed_alice ### Parsing Alice in the wonderland by Lewis Carroll\n",
"entity_type = 'PERSON' ## Type of entry\n",
"number_of_entities = 10 ### Control over obtaining number of defined type entities\n",
"Entities=pd.DataFrame(named_entity_counts(text,entity_type).most_common(number_of_entities),columns=[\"Entity\",\"Count\"])\n",
"entity=[]\n",
"for char in Entities['Entity']:\n",
" entity.append(gender_guess(char.lower(),gender_map))\n",
"Entities['Pred_Gender']=entity\n"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Entity</th>\n",
" <th>Count</th>\n",
" <th>Pred_Gender</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>Alice</td>\n",
" <td>388</td>\n",
" <td>female</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>Gryphon</td>\n",
" <td>32</td>\n",
" <td>unknown</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>Queen</td>\n",
" <td>27</td>\n",
" <td>female</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>Mouse</td>\n",
" <td>25</td>\n",
" <td>unknown</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>Rabbit</td>\n",
" <td>20</td>\n",
" <td>unknown</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>Bill</td>\n",
" <td>12</td>\n",
" <td>male</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>Dormouse</td>\n",
" <td>8</td>\n",
" <td>unknown</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7</th>\n",
" <td>Dinah</td>\n",
" <td>8</td>\n",
" <td>female</td>\n",
" </tr>\n",
" <tr>\n",
" <th>8</th>\n",
" <td>William</td>\n",
" <td>5</td>\n",
" <td>male</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9</th>\n",
" <td>Beau</td>\n",
" <td>4</td>\n",
" <td>male</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Entity Count Pred_Gender\n",
"0 Alice 388 female\n",
"1 Gryphon 32 unknown\n",
"2 Queen 27 female\n",
"3 Mouse 25 unknown\n",
"4 Rabbit 20 unknown\n",
"5 Bill 12 male\n",
"6 Dormouse 8 unknown\n",
"7 Dinah 8 female\n",
"8 William 5 male\n",
"9 Beau 4 male"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"Entities"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Evaluating the Gender classifier for different entities"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Male type Entities"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"male\n",
"male\n",
"male\n",
"male\n",
"male\n",
"male\n",
"male\n"
]
}
],
"source": [
"names=['harry','abdul','homer','gary','robert','wayne','lionel']\n",
"for name in names:\n",
" print(gender_guess(name,gender_map))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Female type entities"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"female\n",
"female\n",
"female\n",
"female\n",
"female\n",
"female\n",
"female\n"
]
}
],
"source": [
"names=['martha','holly','nicole','catherine','ruth','april','christina']\n",
"for name in names:\n",
" print(gender_guess(name,gender_map))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### For first and last names given"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"female\n",
"female\n",
"male\n",
"male\n",
"male\n",
"male\n"
]
}
],
"source": [
"names=['Liz Lemon','Leslie Knope','jesus navas','Robert Lewandowski','Anthony Martial','Wesley Sneijder']\n",
"for name in names:\n",
" print(gender_guess(name,gender_map))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### For names with titles"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"male\n",
"female\n",
"male\n"
]
}
],
"source": [
"print(gender_guess('Sir Alex Ferguson',gender_map))\n",
"print(gender_guess('Lady McElroy',gender_map))\n",
"print(gender_guess('Friedrich Nietzsche',gender_map))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### The program returns 'unknown' if the gender of the entity can't be determined by the function created. This error is a result of the name not being in the names.csv folder or doesn't have a gender bisecting title attached to it"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"unknown\n",
"unknown\n",
"unknown\n"
]
}
],
"source": [
"print(gender_guess('Liam Neeson',gender_map))\n",
"print(gender_guess('Mahatma Gandhi',gender_map))\n",
"print(gender_guess('Pricella McCartney',gender_map))"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.5"
}
},
"nbformat": 4,
"nbformat_minor": 1
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment