Skip to content

Instantly share code, notes, and snippets.

@patrickvankessel
Created September 22, 2021 15:29
Show Gist options
  • Save patrickvankessel/5d6f8ff94ab1da5c88d8e941bbaab81e to your computer and use it in GitHub Desktop.
Save patrickvankessel/5d6f8ff94ab1da5c88d8e941bbaab81e to your computer and use it in GitHub Desktop.
Decoded - keyword oversampling
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Keyword oversampling example\n",
"In this example, I'll demonstrate how keyword oversampling can help reduce the sample size required for content coding when you're trying to establish inter-rater reliability for a relatively rare category in a set of text documents. We'll use the NLTK movie review corpus, which provides a set of positive and negative movie reviews."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import nltk\n",
"import random\n",
"import pandas as pd\n",
"import numpy as np\n",
"from sklearn.metrics import cohen_kappa_score\n",
"from pewanalytics.stats.irr import kappa_sample_size_CI\n",
"from pewanalytics.stats.sampling import compute_sample_weights_from_frame"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"[nltk_data] Downloading package movie_reviews to\n",
"[nltk_data] /home/pvankessel/nltk_data...\n",
"[nltk_data] Package movie_reviews is already up-to-date!\n"
]
},
{
"data": {
"text/plain": [
"True"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"nltk.download(\"movie_reviews\")"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"2000\n"
]
}
],
"source": [
"# Extract the text and category of each movie review in the corpus\n",
"rows = []\n",
"for category in nltk.corpus.movie_reviews.categories():\n",
" for fileid in nltk.corpus.movie_reviews.fileids(category):\n",
" rows.append({\n",
" \"text\": nltk.corpus.movie_reviews.raw(fileid),\n",
" \"category\": category\n",
" })\n",
"df = pd.DataFrame(rows)\n",
"print(len(df))"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [],
"source": [
"# Set a random seed for replication\n",
"random_seed = 42\n",
"random.seed(random_seed)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's create a dataset where a certain category of interest is relatively rare. We'll pull 50 positive reviews and 1000 negative ones, making the positive reviews artificially rare in our dataset. "
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [],
"source": [
"df = pd.concat([\n",
" df[df['category']=='pos'].sample(50, random_state=random_seed),\n",
" df[df['category']=='neg'].sample(1000, random_state=random_seed)\n",
"])"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>text</th>\n",
" <th>category</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>482</th>\n",
" <td>even though i have the utmost respect for rich...</td>\n",
" <td>neg</td>\n",
" </tr>\n",
" <tr>\n",
" <th>951</th>\n",
" <td>the film may be called mercury rising , but th...</td>\n",
" <td>neg</td>\n",
" </tr>\n",
" <tr>\n",
" <th>12</th>\n",
" <td>and now the high-flying hong kong style of fil...</td>\n",
" <td>neg</td>\n",
" </tr>\n",
" <tr>\n",
" <th>872</th>\n",
" <td>fact that charles bronson represents one of th...</td>\n",
" <td>neg</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1139</th>\n",
" <td>one of the last entries in the long-running ca...</td>\n",
" <td>pos</td>\n",
" </tr>\n",
" <tr>\n",
" <th>696</th>\n",
" <td>fit for a ghoul's night out , fat girl stands ...</td>\n",
" <td>neg</td>\n",
" </tr>\n",
" <tr>\n",
" <th>822</th>\n",
" <td>bats is this year's camp flick . \\nwith the wo...</td>\n",
" <td>neg</td>\n",
" </tr>\n",
" <tr>\n",
" <th>745</th>\n",
" <td>conventional wisdom among collectibles retaile...</td>\n",
" <td>neg</td>\n",
" </tr>\n",
" <tr>\n",
" <th>10</th>\n",
" <td>best remembered for his understated performanc...</td>\n",
" <td>neg</td>\n",
" </tr>\n",
" <tr>\n",
" <th>111</th>\n",
" <td>note : some may consider portions of the follo...</td>\n",
" <td>neg</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" text category\n",
"482 even though i have the utmost respect for rich... neg\n",
"951 the film may be called mercury rising , but th... neg\n",
"12 and now the high-flying hong kong style of fil... neg\n",
"872 fact that charles bronson represents one of th... neg\n",
"1139 one of the last entries in the long-running ca... pos\n",
"696 fit for a ghoul's night out , fat girl stands ... neg\n",
"822 bats is this year's camp flick . \\nwith the wo... neg\n",
"745 conventional wisdom among collectibles retaile... neg\n",
"10 best remembered for his understated performanc... neg\n",
"111 note : some may consider portions of the follo... neg"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.sample(10, random_state=random_seed)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Next, we're going to define a function that simulates a content-coding exercise where two coders are tasked with flagging positive reviews. NLTK's existing \"positive\" labels can stand in as our first \"coder,\" and we'll create a second \"coder\" that disagrees with our first coder 1% of the time. Their agreement will be almost perfect - but our job is to confirm that as efficiently as we can."
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [],
"source": [
"def create_fake_coders(sample):\n",
" # Define coder 1 using NLTK's labels\n",
" sample['coder1'] = (sample['category'] == \"pos\").astype(int)\n",
" # Create coder 2\n",
" sample['coder2'] = sample['coder1']\n",
" # Select a random sample of 1% of the dataframe\n",
" replace_index = sample.sample(frac=.01, random_state=random_seed).index\n",
" # And invert coder 2's responses for those rows\n",
" sample.loc[replace_index, \"coder2\"] = sample[sample.index.isin(replace_index)][\"coder1\"].map(lambda x: abs(x-1))\n",
" return sample"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Random sample"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Rather than waste resources double-coding our entire population, it's usually standard practice to draw a sample to determine that our coders agree with each other. If we're successful, we can then divvy up the remaining cases between the two of them. Let's try a random sample of 100 documents."
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [],
"source": [
"sample = df.sample(100, random_state=random_seed)"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [],
"source": [
"# Let's have our coders \"code\" the sample\n",
"sample = create_fake_coders(sample)"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0.8520710059171598"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"kappa = cohen_kappa_score(sample['coder1'], sample['coder2'])\n",
"kappa"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The results from our sample look promising: 0.7 is usually a decent threshold for acceptable agreement, and our coders managed to exceed that with a Cohen's Kappa of 0.85 in our random sample. However, it looks like positive cases were really rare in this sample - and this poses a problem."
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0.035"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"mean = np.average([sample['coder1'].mean(), sample['coder2'].mean()])\n",
"mean"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Despite the promising results, there are so few positive cases in our sample (3.5%) that we don't have the statistical power to confirm (with 95% confidence) that the coders would _actually_ agree with each other at a rate at or above our minimum threshold of 0.7 Kappa were they to code our whole dataset. If we take the preliminary Kappa that we're observing in our random sample and assume that A) the coders will continue to perform similarly on additional cases, and B) that positive cases will continue to appear at roughly the same rate in additional random samples, it looks like we'd need to code a total of more than 400 documents to establish that our coders have acceptable agreement:"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"431"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"random_sample_size = kappa_sample_size_CI(kappa, .7, mean, alpha=.05)\n",
"random_sample_size"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Keyword oversample"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"But what if we could boost the rate of positive cases in our sample from 3.5% to, say, 25%? If that were the case, and our coders managed to achieve the same level of agreement they had in our random sample (0.85), then we'd only have to code 80 documents:"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"80"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"kw_sample_size = kappa_sample_size_CI(kappa, .7, .25)\n",
"kw_sample_size"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This is where keyword oversampling can help. First, we simply need to brainstorm some words that we think might be indicative of the rare positive cases we're trying to find. The list doesn't have to be perfect - even a short list of terms can give us a nice boost:"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [],
"source": [
"OVERSAMPLING_KEYWORDS = [\n",
" \"flawless\", \"outstanding\", \"imaginative\", \"breathtaking\", \"chilling\" , \"marvelous\", \"superb\", \"refreshing\", \n",
" \"terrific\", \"beautiful\", \"poignant\", \"excellent\", \"exceptional\"\n",
"]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Using these keywords, let's pull a new sample. This time, instead of drawing cases randomly, we'll oversample on documents that match to these keywords in an attempt to pull in more positive cases, boosting their prevalence and hopefully reducing the sample size required to confirm our agreement with confidence. First, let's flag documents that match to these keywords:"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [],
"source": [
"df['sampling_flag'] = df['text'].str.contains(r\"|\".join(OVERSAMPLING_KEYWORDS)).astype(int)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Next, let's build a sample of 80 documents where half of the documents match to these keywords and half don't. Again, we're not looking for perfection here - it's unrealistic to hope that this short list of keywords will result in a perfect 50/50 sample of positive vs. negative cases - but if we wind up with something close to a 25/75 split, 80 cases should be enough to confirm that our coders are doing a good job. That would save us a lot of time."
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [],
"source": [
"half_sample = int(float(kw_sample_size)/2.0)\n",
"sample = pd.concat([\n",
" df[df['sampling_flag']==1].sample(half_sample, random_state=random_seed),\n",
" df[df['sampling_flag']==0].sample(half_sample, random_state=random_seed),\n",
"])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Since we're overrepresenting certain documents in our sample, it's important that we account for that with sampling weights (propensity weights), which we can calculate using our `sampling_flag` variable. This will re-adjust our sample to be representative of the population. Documents containing our keywords will be weighted down - but hopefully the increase in positive cases will offset that penalty:"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [],
"source": [
"sample['sampling_weight'] = compute_sample_weights_from_frame(df, sample, ['sampling_flag'])"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [],
"source": [
"# Let's have our coders \"code\" this new sample\n",
"sample = create_fake_coders(sample)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Fortunately, we get a good kappa again, even after down-weighting documents that contain the keywords we oversampled on:"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0.9543736729415849"
]
},
"execution_count": 19,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"kappa = cohen_kappa_score(sample['coder1'], sample['coder2'], sample_weight=sample['sampling_weight'])\n",
"kappa"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"As expected, our keywords helped pull in more positive cases, which comprise 11.25% of our new keyword oversample. That's not quite the 25% we were shooting for, but it's sure better than the 3.5% that we observed in our random sample!"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0.1125"
]
},
"execution_count": 20,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"mean = min([sample['coder1'].mean(), sample['coder2'].mean()])\n",
"mean"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's see if this gave us enough of a boost that we can confirm a Kappa above 0.7. Using our new observed Kappa of 0.95 and the new prevalence of 11.25% (compared with 0.85 and 3.5% in our random sample), it looks like a sample of 80 is more than enough:"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"54"
]
},
"execution_count": 21,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"kappa_sample_size_CI(kappa, .7, mean)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Hooray! With keyword oversampling, we were able to confirm that our coders have acceptable reliablity using a sample of just 80 documents, instead of the 400+ documents we would have had to code if we had just kept sampling randomly."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.6"
}
},
"nbformat": 4,
"nbformat_minor": 4
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment