Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save mattiasostmar/55391aec720ca3fa1f91a6691a2066b9 to your computer and use it in GitHub Desktop.
Save mattiasostmar/55391aec720ca3fa1f91a6691a2066b9 to your computer and use it in GitHub Desktop.
New cleaner code to verify script of [previous classification results](https://gist.github.com/mattiasostmar/05a3e6b4411acd0bb0f003b0ef49f4cc) of Jungian cognitive functions from blog texts.
{
"cells": [
{
"cell_type": "code",
"execution_count": 376,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"import requests\n",
"import os\n",
"import operator\n",
"from sklearn.metrics import classification_report\n",
"import time"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Read raw data file\n",
"A copy of the raw datafile can be found on [Open Science Framework](https://mfr.osf.io/render?url=https://osf.io/gyrc7/?action=download%26mode=render)."
]
},
{
"cell_type": "code",
"execution_count": 251,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Number of rows: 22588\n",
"nf 9028\n",
"nt 7836\n",
"sf 3035\n",
"st 2189\n",
"Name: actual_temp, dtype: int64\n",
"\n"
]
}
],
"source": [
"df_pickle_path = \"../../pickles/dataframe_survey_2018-01-23_enriched.pickle\"\n",
"indata = pd.read_pickle(df_pickle_path)\n",
"# remove non-English texts\n",
"indata = indata[indata.lang == \"en\"]\n",
"print(\"Number of rows: {}\".format(len(indata)))\n",
"print(indata.actual_temp.value_counts())\n",
"# print(indata.func.value_counts()) # dominant function for each myers-briggs typ, not used in this experiment\n",
"print()"
]
},
{
"cell_type": "raw",
"metadata": {},
"source": [
"1. Separera ut koden som delar upp data i träning och eval. Du kan med fördel även ta 1000 exempel av vardera klass från de exempel som inte valdes ut bland de 6000 train-eval, som hängslen + livrem. Alltså training (4200), eval (2100) och external (2000)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Perceiving functions\n",
"## Filter out sensing (s) and intuition (n) sampes into perceiving samples dataset\n",
"Since we have 5224 (3035 + 2189) examples of s (sensing) as the smallest class and we want the two classes to be trained on fairly similar amount of text we go for dividing 5000 examples from both sensing (s) and intuition (n) into training, evaluation and external example sets. Even though we have 16 864 (9028 + 7836) examples of intuition (n).\n",
"\n",
"First we create two new columns separating perceiving (s/n) from judging (t/f)."
]
},
{
"cell_type": "code",
"execution_count": 309,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"n 16864\n",
"s 5224\n",
"Name: perc_func, dtype: int64"
]
},
"execution_count": 309,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"indata[\"perc_func\"] = indata.actual_temp.str.extract(\"(\\w)\\w\", expand=False)\n",
"indata.perc_func.value_counts()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Then we sample 5000 from each class. We keep the column \"actual_temp\" e.g. st, sf, nt, nf to be able to sanity check the values in the perc_func column."
]
},
{
"cell_type": "code",
"execution_count": 310,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"n 5000\n",
"s 5000\n",
"Name: perc_func, dtype: int64"
]
},
"execution_count": 310,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"perc_samples = pd.concat([\n",
" indata[indata.perc_func == \"s\"].sample(5000, random_state=123456)[[\"text\",\"perc_func\",\"actual_temp\"]],\n",
" indata[indata.perc_func == \"n\"].sample(5000, random_state=123456)[[\"text\",\"perc_func\",\"actual_temp\"]]\n",
" ])\n",
"perc_samples.perc_func.value_counts()"
]
},
{
"cell_type": "code",
"execution_count": 311,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>text</th>\n",
" <th>perc_func</th>\n",
" <th>actual_temp</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>8623</th>\n",
" <td>Sonny Jooooooooon INDEX ASK PAST THEME Sonny J...</td>\n",
" <td>s</td>\n",
" <td>sf</td>\n",
" </tr>\n",
" <tr>\n",
" <th>11987</th>\n",
" <td>Log in | Tumblr Sign up Terms Privacy Posted b...</td>\n",
" <td>s</td>\n",
" <td>st</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" text perc_func actual_temp\n",
"8623 Sonny Jooooooooon INDEX ASK PAST THEME Sonny J... s sf\n",
"11987 Log in | Tumblr Sign up Terms Privacy Posted b... s st"
]
},
"execution_count": 311,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"perc_samples.head(2)"
]
},
{
"cell_type": "code",
"execution_count": 312,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>text</th>\n",
" <th>perc_func</th>\n",
" <th>actual_temp</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>1034</th>\n",
" <td>A small dose of life. A small dose of life. A ...</td>\n",
" <td>n</td>\n",
" <td>nf</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6086</th>\n",
" <td>ehhhhh whatever ehhhhh whatever whining about ...</td>\n",
" <td>n</td>\n",
" <td>nf</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" text perc_func actual_temp\n",
"1034 A small dose of life. A small dose of life. A ... n nf\n",
"6086 ehhhhh whatever ehhhhh whatever whining about ... n nf"
]
},
"execution_count": 312,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"perc_samples.tail(2)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Extra check 1: Assert no overlap present in the indecies for s and n examples already"
]
},
{
"cell_type": "code",
"execution_count": 313,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"10000"
]
},
"execution_count": 313,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Using the index method unique(), expected value 10000\n",
"len(perc_samples.index.unique())"
]
},
{
"cell_type": "code",
"execution_count": 314,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0"
]
},
"execution_count": 314,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Using the index method intersection, expected value 0\n",
"len( s_samples.index.intersection(n_samples.index) ) "
]
},
{
"cell_type": "code",
"execution_count": 317,
"metadata": {},
"outputs": [],
"source": [
"# Store the perceiving samples in separate file, preserving the orginal index values in 'origIx'\n",
"perc_samples.to_csv(\"perceiving_samples_n10000.csv\", sep=\";\", index=True, index_label=\"origIx\")"
]
},
{
"cell_type": "code",
"execution_count": 326,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([ 8623, 11987, 5340, ..., 5822, 1034, 6086])"
]
},
"execution_count": 326,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"perc_samples_test.origIx.values"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Extra check 2: Inspect origIx with perc_samples.index"
]
},
{
"cell_type": "code",
"execution_count": 328,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Unique ix: 10000\n",
"Unique origIx: 10000\n"
]
}
],
"source": [
"# Read in the file to make sure ix and origIx looks reasonable\n",
"perc_samples_test = pd.read_csv(\"perceiving_samples_n10000.csv\", sep=\";\")\n",
"print(\"Unique ix: {}\".format(len(perc_samples_test.index.unique())))\n",
"print(\"Unique origIx: {}\".format(len(perc_samples_test.origIx.unique())))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Something seems to be wrong below"
]
},
{
"cell_type": "code",
"execution_count": 333,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"3738"
]
},
"execution_count": 333,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Intersection of values in samples ix and original indicies. Expected value 0.\n",
"len(perc_samples_test.index.intersection(perc_samples_test.origIx))"
]
},
{
"cell_type": "code",
"execution_count": 350,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"3738"
]
},
"execution_count": 350,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Adding .values to origIx just to explore if there is something peculiar with how Pandas uses .intersection\n",
"len(perc_samples_test.index.intersection(perc_samples_test.origIx.values))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Why is it 3738? Shouldn't it be all 10000 - or 0 if I confused intersection with union? \n",
"\n",
"Could it be the way index.intersection works?"
]
},
{
"cell_type": "code",
"execution_count": 351,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[0 1 2]\n",
"[0 1 2]\n",
"3\n",
"1\n",
"0\n"
]
}
],
"source": [
"s1 = pd.Series([1,2,3])\n",
"s2 = pd.Series([2,3,4])\n",
"s3 = pd.Series([3,4,5])\n",
"print(s1.index.values)\n",
"print(s2.index.values)\n",
"print(len(s1.index.intersection(s2.index)))\n",
"print(len(s1.index.intersection(s2.values))) # Expected value 1 since 2 is also in s1.index.values\n",
"print(len(s1.index.intersection(s3.values))) # Expected value 0 since no values in [3,4,5] are in [0,1,2]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Nope, index.intersection behaves as expected.\n",
"\n",
"Or is it in perc_samples.ix?"
]
},
{
"cell_type": "code",
"execution_count": 355,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Int64Index([ 8623, 11987, 5340, 23420, 18909, 7557, 17158, 20508, 4604,\n",
" 14753,\n",
" ...\n",
" 8081, 13099, 16591, 21953, 14180, 4113, 24786, 5822, 1034,\n",
" 6086],\n",
" dtype='int64', length=10000)"
]
},
"execution_count": 355,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"perc_samples.index"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Yes. There it was. The indicies in perc_samples are in *their* turn from indata.index which has length 22588."
]
},
{
"cell_type": "code",
"execution_count": 357,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Int64Index([ 1, 2, 3, 5, 10, 11, 14, 15, 16,\n",
" 17,\n",
" ...\n",
" 25425, 25426, 25428, 25429, 25430, 25431, 25432, 25435, 25436,\n",
" 25437],\n",
" dtype='int64', length=22588)"
]
},
"execution_count": 357,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"indata.index"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Maybe we should reset index for perc_samples?\n",
"\n",
"**No.** As long as origIx is sure not to have duplicates we can still ensure that training and evaluation datasets don't overlap."
]
},
{
"cell_type": "raw",
"metadata": {},
"source": [
"2. Spara ner dessa tre dataset som csv. Detta blir den “officiella” uppdelningen i olika dataset"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Separate training, evaluation and external data and store it to separate files\n",
"Select 2100 examples for training, 900 for evaluation and 2000 for an extra possibility to check.\n",
"\n",
"Also store original index number to be able to **ensure that training and evaluation examples do not overlap**."
]
},
{
"cell_type": "code",
"execution_count": 358,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Sensing total: 5000\n",
"training: 2100\n",
"evaluation: 900\n",
"external: 2000\n",
"\n",
"Intuition total: 5000\n",
"training: 2100\n",
"evaluation: 900\n",
"external: 2000\n"
]
}
],
"source": [
"# sensing \n",
"s_examples = perc_samples[perc_samples.perc_func == \"s\"]\n",
"print(\"Sensing total: {}\".format(len(s_examples)))\n",
"\n",
"s_training = s_examples.iloc[:2100] # select first 2100\n",
"print(\"training: {}\".format(len(s_training)))\n",
"\n",
"s_evaluation = s_examples.iloc[2100:3000] # select next 900\n",
"print(\"evaluation: {}\".format(len(s_evaluation)))\n",
"\n",
"s_external = s_examples.iloc[3000:5000] # select remaining 2000\n",
"print(\"external: {}\".format(len(s_external)))\n",
"print()\n",
"\n",
"# intution\n",
"n_examples = perc_samples[perc_samples.perc_func == \"n\"]\n",
"print(\"Intuition total: {}\".format(len(s_examples)))\n",
"\n",
"n_training = n_examples.iloc[:2100] # select first 2100\n",
"print(\"training: {}\".format(len(s_training)))\n",
"\n",
"n_evaluation = n_examples.iloc[2100:3000] # select next 900\n",
"print(\"evaluation: {}\".format(len(s_evaluation)))\n",
"\n",
"n_external = n_examples.iloc[3000:5001] # select remaining 2000\n",
"print(\"external: {}\".format(len(s_external)))\n",
"\n",
"# Combine sensing + intution to creade perceiving classification training, evaluation and external datasets\n",
"perc_training = pd.concat([s_training, n_training])\n",
"perc_evaluation = pd.concat([s_evaluation, n_evaluation])\n",
"perc_external = pd.concat([s_external, n_external])\n",
"\n",
"# Store each dataset to separate CSV-files\n",
"perc_training.to_csv(\"perc_trainingdata_n4200.csv\", sep=\";\", index=True, index_label=\"origIx\")\n",
"perc_evaluation.to_csv(\"perc_evaluationdata_n1800.csv\", sep=\";\", index=True, index_label=\"origIx\")\n",
"perc_external.to_csv(\"perc_externaldata_n4000.csv\", sep=\";\", index=True, index_label=\"origIx\")"
]
},
{
"cell_type": "raw",
"metadata": {},
"source": [
"3. Läs in training från fil och träna klassificerare\n",
"3b. Assert:a att det inte finns ngt överlapp i index mellan training och eval"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Train uClassify perceiving classifier \n",
"Make sure to load the training data from the correct file containing training data *only*."
]
},
{
"cell_type": "code",
"execution_count": 259,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Training data rows: 4200\n",
"Training top 5 rows: \n",
" origIx text perc_func \\\n",
"0 8623 Sonny Jooooooooon INDEX ASK PAST THEME Sonny J... s \n",
"1 11987 Log in | Tumblr Sign up Terms Privacy Posted b... s \n",
"2 5340 a thing of blood © hi im logan and i love the ... s \n",
"3 23420 Nobody can be uncheered with a baloon (^ v ^) ... s \n",
"4 18909 Wit Beyond Measure Wit Beyond Measure Aug 14, ... s \n",
"\n",
" actual_temp \n",
"0 sf \n",
"1 st \n",
"2 sf \n",
"3 sf \n",
"4 st \n",
"\n"
]
}
],
"source": [
"training = pd.read_csv(\"perc_trainingdata_n4200.csv\", sep=\";\")\n",
"print(\"Training data rows: {}\".format(len(trainingdata)))\n",
"print(\"Training top 5 rows: \\n{}\".format(trainingdata.head(5)))\n",
"print()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Assert no overlap between training and evaluation data"
]
},
{
"cell_type": "code",
"execution_count": 359,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Evaluationset is subset of trainingset (element-wise): False\n",
"The length of the union of training set ix and evaluation set ix: 6000\n",
"The lenght of the intersection of training set and evaluation set: 0\n"
]
}
],
"source": [
"trainingset = set(training.origIx.values)\n",
"evaluation = pd.read_csv(\"perc_evaluationdata_n1800.csv\", sep=\";\", )\n",
"evaluationset = set(evaluation.origIx.values)\n",
"print(\"Evaluationset is subset of trainingset (element-wise): {}\".format(trainingset.issubset(evaluationset)))\n",
"\n",
"# Expected union of training and evaluation is 4200 + 1800 = 6000\n",
"print(\"The length of the union of training set ix and evaluation set ix: {}\".format(len(trainingset.union(evaluationset))))\n",
"\n",
"# Expected intersection of training and evaluation is 0\n",
"print(\"The lenght of the intersection of training set and evaluation set: {}\".format(len(trainingset.intersection(evaluationset))))\n"
]
},
{
"cell_type": "code",
"execution_count": 260,
"metadata": {},
"outputs": [],
"source": [
"def train_jung_cognitive_functions_en_classes(func, classifier):\n",
" \"\"\"Presupposes that classifier is created and that setup_jung_functions_en_classes() is already run.\n",
" func: expects one of [\"s\",\"n\",\"t\",\"f\"]\n",
" classifier: expects on of [\"sntf\", \"tf\", \"sn\"]\n",
" \n",
" \"\"\"\n",
" if classifier == \"sn\":\n",
" \n",
" data = {\"texts\":[row[\"text\"]]}\n",
" header = {\"Content-Type\": \"application/json\",\n",
" \"Authorization\": \"Token \" + os.environ[\"UCLASSIFY_WRITE\"]}\n",
" \n",
" try:\n",
" response = requests.post('https://api.uclassify.com/v1/me/jung-perceiving-verification-20180321-no2/' + func + \"/train\", \n",
" json = data,\n",
" headers = header)\n",
" except Exception as e:\n",
" print(\"Error: {}. retrying in 3 minutes.\")\n",
" time.sleep(180)\n",
" response = requests.post('https://api.uclassify.com/v1/me/jung-perceiving-verification-20180321-no2/' + func + \"/train\", \n",
" json = data,\n",
" headers = header)\n",
" \n",
" elif classifier == \"tf\":\n",
" \n",
" data = {\"texts\":[row[\"text\"]]}\n",
" header = {\"Content-Type\": \"application/json\",\n",
" \"Authorization\": \"Token \" + os.environ[\"UCLASSIFY_WRITE\"]}\n",
" \n",
" try:\n",
" response = requests.post('https://api.uclassify.com/v1/me/jung-judging-verification-20180321-no2/' + func + \"/train\", \n",
" json = data,\n",
" headers = header)\n",
" except Exception as e:\n",
" print(\"Error: {}. retrying in 3 minutes.\")\n",
" time.sleep(180)\n",
" response = requests.post('https://api.uclassify.com/v1/me/jung-judging-verification-20180321-no2/' + func + \"/train\", \n",
" json = data,\n",
" headers = header)"
]
},
{
"cell_type": "code",
"execution_count": 261,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Row 100 of 4200 trained.\n",
"Row 200 of 4200 trained.\n",
"Row 300 of 4200 trained.\n",
"Row 400 of 4200 trained.\n",
"Row 500 of 4200 trained.\n",
"Row 600 of 4200 trained.\n",
"Row 700 of 4200 trained.\n",
"Row 800 of 4200 trained.\n",
"Row 900 of 4200 trained.\n",
"Row 1000 of 4200 trained.\n",
"Row 1100 of 4200 trained.\n",
"Row 1200 of 4200 trained.\n",
"Row 1300 of 4200 trained.\n",
"Row 1400 of 4200 trained.\n",
"Row 1500 of 4200 trained.\n",
"Row 1600 of 4200 trained.\n",
"Row 1700 of 4200 trained.\n",
"Row 1800 of 4200 trained.\n",
"Row 1900 of 4200 trained.\n",
"Row 2000 of 4200 trained.\n",
"Row 2100 of 4200 trained.\n",
"Row 2200 of 4200 trained.\n",
"Row 2300 of 4200 trained.\n",
"Row 2400 of 4200 trained.\n",
"Row 2500 of 4200 trained.\n",
"Row 2600 of 4200 trained.\n",
"Row 2700 of 4200 trained.\n",
"Row 2800 of 4200 trained.\n",
"Row 2900 of 4200 trained.\n",
"Row 3000 of 4200 trained.\n",
"Row 3100 of 4200 trained.\n",
"Row 3200 of 4200 trained.\n",
"Row 3300 of 4200 trained.\n",
"Row 3400 of 4200 trained.\n",
"Row 3500 of 4200 trained.\n",
"Row 3600 of 4200 trained.\n",
"Row 3700 of 4200 trained.\n",
"Row 3800 of 4200 trained.\n",
"Row 3900 of 4200 trained.\n",
"Row 4000 of 4200 trained.\n",
"Row 4100 of 4200 trained.\n",
"Row 4200 of 4200 trained.\n"
]
}
],
"source": [
"row_cnt = 1\n",
"for ix, row in trainingdata.iterrows():\n",
" train_jung_cognitive_functions_en_classes(func=row[\"perc_func\"], classifier=\"sn\")\n",
" if row_cnt % 100 == 0:\n",
" print(\"Row {} of {} trained.\".format(row_cnt, len(trainingdata)))\n",
" row_cnt += 1"
]
},
{
"cell_type": "raw",
"metadata": {},
"source": [
"4. Läs in eval från fil och utvärdera"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Evaluate perceiving classifier on un-seen data\n",
"Make sure to read in the correct file containing evaluation data *only*."
]
},
{
"cell_type": "code",
"execution_count": 360,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"<class 'pandas.core.frame.DataFrame'>\n",
"RangeIndex: 1800 entries, 0 to 1799\n",
"Data columns (total 4 columns):\n",
"origIx 1800 non-null int64\n",
"text 1800 non-null object\n",
"perc_func 1800 non-null object\n",
"actual_temp 1800 non-null object\n",
"dtypes: int64(1), object(3)\n",
"memory usage: 56.3+ KB\n",
"None\n"
]
}
],
"source": [
"evaluation = pd.read_csv(\"perc_evaluationdata_n1800.csv\", sep=\";\", )\n",
"print(evaluation.info())"
]
},
{
"cell_type": "code",
"execution_count": 361,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>origIx</th>\n",
" <th>text</th>\n",
" <th>perc_func</th>\n",
" <th>actual_temp</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>954</td>\n",
" <td>Dizzy With Enchantments Dizzy With Enchantment...</td>\n",
" <td>s</td>\n",
" <td>sf</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>15360</td>\n",
" <td>You Drool When You Sleep Create Destroy Ask Th...</td>\n",
" <td>s</td>\n",
" <td>st</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>18179</td>\n",
" <td>Lazyrainbow562 Lazyrainbow562 Deviantart Art B...</td>\n",
" <td>s</td>\n",
" <td>sf</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" origIx text perc_func \\\n",
"0 954 Dizzy With Enchantments Dizzy With Enchantment... s \n",
"1 15360 You Drool When You Sleep Create Destroy Ask Th... s \n",
"2 18179 Lazyrainbow562 Lazyrainbow562 Deviantart Art B... s \n",
"\n",
" actual_temp \n",
"0 sf \n",
"1 st \n",
"2 sf "
]
},
"execution_count": 361,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"evaluation.head(3)"
]
},
{
"cell_type": "code",
"execution_count": 362,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>origIx</th>\n",
" <th>text</th>\n",
" <th>perc_func</th>\n",
" <th>actual_temp</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>1797</th>\n",
" <td>4703</td>\n",
" <td>I think I saw you in my sleep Lee | 17 | Canad...</td>\n",
" <td>n</td>\n",
" <td>nf</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1798</th>\n",
" <td>11842</td>\n",
" <td>Wow, Fantastic Baby Archive Ask Submit Wow, Fa...</td>\n",
" <td>n</td>\n",
" <td>nf</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1799</th>\n",
" <td>14711</td>\n",
" <td>A Social Pariah A Social Pariah Elizabeth - 16</td>\n",
" <td>n</td>\n",
" <td>nt</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" origIx text perc_func \\\n",
"1797 4703 I think I saw you in my sleep Lee | 17 | Canad... n \n",
"1798 11842 Wow, Fantastic Baby Archive Ask Submit Wow, Fa... n \n",
"1799 14711 A Social Pariah A Social Pariah Elizabeth - 16 n \n",
"\n",
" actual_temp \n",
"1797 nf \n",
"1798 nf \n",
"1799 nt "
]
},
"execution_count": 362,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"evaluation.tail(3)"
]
},
{
"cell_type": "code",
"execution_count": 265,
"metadata": {},
"outputs": [],
"source": [
"def classify_jung_percieving_function_of_text(text):\n",
" \"\"\"Does what it says, pretty much.\"\"\"\n",
" header = {\"Content-Type\": \"application/json\",\n",
" \"Authorization\": \"Token \" + os.environ[\"UCLASSIFY_READ\"]}\n",
" data = {\"texts\":[text]} # send a one-item list for now, since we don't have a feel for sizes\n",
" try:\n",
" result = requests.post(\"https://api.uclassify.com/v1/prfekt/jung-perceiving-verification-20180321-no2/classify\",\n",
" json = data,\n",
" headers = header)\n",
" except Exception as e:\n",
" print(\"Error connecting with uClassify. Retrying in 3 minutes.\")\n",
" time.sleep(180)\n",
" result = requests.post(\"https://api.uclassify.com/v1/prfekt/jung-perceiving-verification-20180321-no2/classify\",\n",
" json = data,\n",
" headers = header)\n",
" \n",
" json_result = result.json()\n",
" \n",
" res_dict = {\"s\":0, \"n\":0}\n",
" \n",
" for classItem in json_result[0][\"classification\"]:\n",
" res_dict[classItem[\"className\"]] = classItem[\"p\"]\n",
" \n",
" sorted_dict = sorted(res_dict.items(), key=operator.itemgetter(1), reverse=True)\n",
" return sorted_dict"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now, let's classify the text in each row in the evaluationdata and store the best classification result in a list."
]
},
{
"cell_type": "code",
"execution_count": 266,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Row 100 of 1800 classified.\n",
"Row 200 of 1800 classified.\n",
"Row 300 of 1800 classified.\n",
"Row 400 of 1800 classified.\n",
"Row 500 of 1800 classified.\n",
"Row 600 of 1800 classified.\n",
"Row 700 of 1800 classified.\n",
"Row 800 of 1800 classified.\n",
"Row 900 of 1800 classified.\n",
"Row 1000 of 1800 classified.\n",
"Row 1100 of 1800 classified.\n",
"Row 1200 of 1800 classified.\n",
"Row 1300 of 1800 classified.\n",
"Row 1400 of 1800 classified.\n",
"Row 1500 of 1800 classified.\n",
"Row 1600 of 1800 classified.\n",
"Row 1700 of 1800 classified.\n",
"Row 1800 of 1800 classified.\n"
]
}
],
"source": [
"sn_results = []\n",
"row_cnt = 1\n",
"for ix, row in evaluation.iterrows():\n",
" # The function returns a sorted list of tuples, max class first e.g. [('n', 0.528311), ('s'. 0.471689)]\n",
" res = classify_jung_percieving_function_of_text(row[\"text\"])\n",
" sn_results.append(res[0][0])\n",
" if row_cnt % 100 == 0:\n",
" print(\"Row {} of {} classified.\".format(row_cnt, len(evaluation)))\n",
" row_cnt += 1 "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Then we add the classification results as a separate column named \"uClassify\" to the evaluation DataFrame."
]
},
{
"cell_type": "code",
"execution_count": 363,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>origIx</th>\n",
" <th>text</th>\n",
" <th>perc_func</th>\n",
" <th>actual_temp</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>954</td>\n",
" <td>Dizzy With Enchantments Dizzy With Enchantment...</td>\n",
" <td>s</td>\n",
" <td>sf</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>15360</td>\n",
" <td>You Drool When You Sleep Create Destroy Ask Th...</td>\n",
" <td>s</td>\n",
" <td>st</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" origIx text perc_func \\\n",
"0 954 Dizzy With Enchantments Dizzy With Enchantment... s \n",
"1 15360 You Drool When You Sleep Create Destroy Ask Th... s \n",
"\n",
" actual_temp \n",
"0 sf \n",
"1 st "
]
},
"execution_count": 363,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"evaluation.head(2)"
]
},
{
"cell_type": "code",
"execution_count": 366,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>origIx</th>\n",
" <th>text</th>\n",
" <th>perc_func</th>\n",
" <th>actual_temp</th>\n",
" <th>0</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>954</td>\n",
" <td>Dizzy With Enchantments Dizzy With Enchantment...</td>\n",
" <td>s</td>\n",
" <td>sf</td>\n",
" <td>n</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>15360</td>\n",
" <td>You Drool When You Sleep Create Destroy Ask Th...</td>\n",
" <td>s</td>\n",
" <td>st</td>\n",
" <td>n</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>18179</td>\n",
" <td>Lazyrainbow562 Lazyrainbow562 Deviantart Art B...</td>\n",
" <td>s</td>\n",
" <td>sf</td>\n",
" <td>s</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" origIx text perc_func \\\n",
"0 954 Dizzy With Enchantments Dizzy With Enchantment... s \n",
"1 15360 You Drool When You Sleep Create Destroy Ask Th... s \n",
"2 18179 Lazyrainbow562 Lazyrainbow562 Deviantart Art B... s \n",
"\n",
" actual_temp 0 \n",
"0 sf n \n",
"1 st n \n",
"2 sf s "
]
},
"execution_count": 366,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"classified_evaluation = pd.concat([evaluation, pd.DataFrame(sn_results, index=evaluation.index)],\n",
" axis=1)\n",
"classified_evaluation.head(3)"
]
},
{
"cell_type": "code",
"execution_count": 368,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>origIx</th>\n",
" <th>text</th>\n",
" <th>perc_func</th>\n",
" <th>actual_temp</th>\n",
" <th>uClassify</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>954</td>\n",
" <td>Dizzy With Enchantments Dizzy With Enchantment...</td>\n",
" <td>s</td>\n",
" <td>sf</td>\n",
" <td>n</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>15360</td>\n",
" <td>You Drool When You Sleep Create Destroy Ask Th...</td>\n",
" <td>s</td>\n",
" <td>st</td>\n",
" <td>n</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>18179</td>\n",
" <td>Lazyrainbow562 Lazyrainbow562 Deviantart Art B...</td>\n",
" <td>s</td>\n",
" <td>sf</td>\n",
" <td>s</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" origIx text perc_func \\\n",
"0 954 Dizzy With Enchantments Dizzy With Enchantment... s \n",
"1 15360 You Drool When You Sleep Create Destroy Ask Th... s \n",
"2 18179 Lazyrainbow562 Lazyrainbow562 Deviantart Art B... s \n",
"\n",
" actual_temp uClassify \n",
"0 sf n \n",
"1 st n \n",
"2 sf s "
]
},
"execution_count": 368,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Fix the missing column name. This behaviour is related to pd.concat and might change in Pandas 0.23.0\n",
"classified_evaluation.columns = [\"origIx\",\"text\",\"perc_func\",\"actual_temp\",\"uClassify\"]\n",
"classified_evaluation.head(3)"
]
},
{
"cell_type": "code",
"execution_count": 372,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>origIx</th>\n",
" <th>text</th>\n",
" <th>perc_func</th>\n",
" <th>actual_temp</th>\n",
" <th>uClassify</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>1797</th>\n",
" <td>4703</td>\n",
" <td>I think I saw you in my sleep Lee | 17 | Canad...</td>\n",
" <td>n</td>\n",
" <td>nf</td>\n",
" <td>s</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1798</th>\n",
" <td>11842</td>\n",
" <td>Wow, Fantastic Baby Archive Ask Submit Wow, Fa...</td>\n",
" <td>n</td>\n",
" <td>nf</td>\n",
" <td>n</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1799</th>\n",
" <td>14711</td>\n",
" <td>A Social Pariah A Social Pariah Elizabeth - 16</td>\n",
" <td>n</td>\n",
" <td>nt</td>\n",
" <td>n</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" origIx text perc_func \\\n",
"1797 4703 I think I saw you in my sleep Lee | 17 | Canad... n \n",
"1798 11842 Wow, Fantastic Baby Archive Ask Submit Wow, Fa... n \n",
"1799 14711 A Social Pariah A Social Pariah Elizabeth - 16 n \n",
"\n",
" actual_temp uClassify \n",
"1797 nf s \n",
"1798 nf n \n",
"1799 nt n "
]
},
"execution_count": 372,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Just to see if anything looks funny\n",
"classified_evaluation.tail(3)"
]
},
{
"cell_type": "code",
"execution_count": 375,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Accuracy: 0.5594444444444444\n"
]
}
],
"source": [
"sn_accuracy = sum(classified_evaluation[\"perc_func\"]==classified_evaluation[\"uClassify\"])/len(classified_evaluation)\n",
"print(\"Accuracy: {}\".format(sn_accuracy))"
]
},
{
"cell_type": "code",
"execution_count": 371,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" precision recall f1-score support\n",
"\n",
" n 0.57 0.51 0.54 900\n",
" s 0.55 0.61 0.58 900\n",
"\n",
"avg / total 0.56 0.56 0.56 1800\n",
"\n"
]
}
],
"source": [
"sn_cr = classification_report(classified_evaluation[\"perc_func\"], classified_evaluation[\"uClassify\"])\n",
"print(sn_cr)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
" # Results from 3 separate runs of this notebook"
]
},
{
"cell_type": "raw",
"metadata": {},
"source": [
"First run:\n",
"Using https://api.uclassify.com/v1/prfekt/jungian-cognitive-function-sensing-intuition/classify: # by mistake\n",
"Accuracy: 0.8727777777777778\n",
"precision recall f1-score support\n",
"\n",
" n 0.88 0.87 0.87 900\n",
" s 0.87 0.88 0.87 900\n",
"\n",
"avg / total 0.87 0.87 0.87 1800\n",
"\n",
"\n",
"Second run:\n",
"Using https://api.uclassify.com/v1/prfekt/jung-perceiving-verification-20180321/classify\n",
"Accuracy: 0.5594444444444444\n",
"precision recall f1-score support\n",
"\n",
" n 0.57 0.51 0.54 900\n",
" s 0.55 0.61 0.58 900\n",
"\n",
"avg / total 0.56 0.56 0.56 1800\n",
"\n",
"Third run:\n",
"Using https://api.uclassify.com/v1/prfekt/jung-perceiving-verification-20180321-no2/classify\n",
"Accuracy: 0.5594444444444444\n",
"precision recall f1-score support\n",
"\n",
" n 0.57 0.51 0.54 900\n",
" s 0.55 0.61 0.58 900\n",
"\n",
"avg / total 0.56 0.56 0.56 1800\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Ahaaaa. The experiment was nitially ran with the wrong classifier url in line 7 in in[265]. The accuracy figure looks familiar. Apparently we get an accuracy at 0.87 when using that classifier on \"unseen\" data which in relation to the other results means that there is an overlap between training and evaluation data in the data that was used to train that classifier.\n",
"\n",
"The true results for perceiving classification with 2100 training examples and 900 evaluation examples is an accuracy of 0.559.\n",
"\n",
"I still haven't been able to isolate the error in the previous notebook, however. It has probably something to do with the sampling."
]
},
{
"cell_type": "raw",
"metadata": {},
"source": [
"5. Om det fortfarande är bra accuracy, läs in external från fil och utvärdera"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"anaconda-cloud": {},
"kernelspec": {
"display_name": "Python [conda env:memeticscience]",
"language": "python",
"name": "conda-env-memeticscience-py"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.5.5"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
@mattiasostmar
Copy link
Author

Removed documentation of extra checks from code to make it cleaner to the eye. See version 1 to follow my train of thought.

@mattiasostmar
Copy link
Author

'import time' needed to handle connection errors when calling uClassify

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment