Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 1 You must be signed in to fork a gist
  • Save georgehc/74460ce9674e57f90d1811606ce0a853 to your computer and use it in GitHub Desktop.
Save georgehc/74460ce9674e57f90d1811606ce0a853 to your computer and use it in GitHub Desktop.
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 94-775/95-865: Co-Occurrence Analysis for Finding Possible Relationships\n",
"\n",
"Author: George H. Chen (georgechen [at symbol] cmu.edu)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We begin by importing `numpy`, and telling it to displays numbers to 5 decimal places and suppress printing out tiny numbers in scientific notation:"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import numpy as np\n",
"np.set_printoptions(precision=5, suppress=True)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Calculating joint and marginal probability tables given a co-occurrence table"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We work off the following co-occurrence table:\n",
"\n",
"\n",
"| <i></i> | Apple | Facebook | Tesla |\n",
"| --------------- |:-----:|:--------:|:-----:|\n",
"| Elon Musk | 10 | 15 | 300 |\n",
"| Mark Zuckerberg | 500 | 10000 | 500 |\n",
"| Tim Cook | 200 | 30 | 10 |\n",
"\n",
"\n",
"So Elon Musk and Apple co-occur in 10 news articles, Elon Musk and Facebook co-occur in 15 news articles, etc."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[[ 10 15 300]\n",
" [ 500 10000 500]\n",
" [ 200 30 10]]\n"
]
}
],
"source": [
"co_occurrence_table = np.array([[10, 15, 300],\n",
" [500, 10000, 500],\n",
" [200, 30, 10]])\n",
"print(co_occurrence_table)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The joint probability table can be obtained by dividing every entry of the co-occurrence table by the total number of co-occurrences:"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[[0.00086 0.0013 0.02594]\n",
" [0.04323 0.86468 0.04323]\n",
" [0.01729 0.00259 0.00086]]\n"
]
}
],
"source": [
"joint_prob_table = co_occurrence_table / co_occurrence_table.sum()\n",
"print(joint_prob_table)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To get the marginal probabilities **P(Elon Musk)**, **P(Mark Zuckerberg)**, and **P(Tim Cook)**, we sum the joint probability table across columns. In `numpy`, axis 0 corresponds to rows, and axis 1 corresponds to columns. To sum across columns, we do the following:"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[0.0281 0.95115 0.02075]\n"
]
}
],
"source": [
"people_prob = joint_prob_table.sum(axis=1)\n",
"print(people_prob)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To get the marginal probabilities **P(Apple)**, **P(Facebook)**, and **P(Tesla)**, we sum across rows of the joint probability table:"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[0.06139 0.86857 0.07004]\n"
]
}
],
"source": [
"company_prob = joint_prob_table.sum(axis=0)\n",
"print(company_prob)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Next, we compute what the joint probability table would be *if people and companies were independent*. We show two different approaches for doing this. The first is a straightforward calculation that uses the formula $P(A, B)=P(A)P(B)$ when $A$ and $B$ are independent:"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[[0.00173 0.02441 0.00197]\n",
" [0.05839 0.82614 0.06662]\n",
" [0.00127 0.01802 0.00145]]\n"
]
}
],
"source": [
"joint_prob_table_if_people_and_companies_are_indep = np.zeros((len(people_prob), len(company_prob)))\n",
"for row_idx in range(len(people_prob)):\n",
" for col_idx in range(len(company_prob)):\n",
" joint_prob_table_if_people_and_companies_are_indep[row_idx, col_idx] = people_prob[row_idx] * company_prob[col_idx]\n",
"print(joint_prob_table_if_people_and_companies_are_indep)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The more elegant, slicker approach for those who know linear algebra is to recognize that we just need to take the outer product between the marginal probabilities for people and the marginal probabilities for companies:"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[[0.00173 0.02441 0.00197]\n",
" [0.05839 0.82614 0.06662]\n",
" [0.00127 0.01802 0.00145]]\n"
]
}
],
"source": [
"joint_prob_table_if_people_and_companies_are_indep = np.outer(people_prob, company_prob)\n",
"print(joint_prob_table_if_people_and_companies_are_indep)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Computing pointwise mutual information (PMI)\n",
"\n",
"Next, we compute the PMI between each person and each company. The formula for PMI is $$PMI(A,B)=\\log\\frac{P(A,B)}{P(A)P(B)},$$where the base of the logarithm is not actually important (we will use log base 2)."
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[[-0.99657 -4.23412 3.72022]\n",
" [-0.43363 0.06578 -0.62373]\n",
" [ 3.76277 -2.79671 -0.74926]]\n"
]
}
],
"source": [
"PMI = np.log2(joint_prob_table /\n",
" joint_prob_table_if_people_and_companies_are_indep)\n",
"print(PMI)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"By ranking the 9 PMIs from largest to smallest, and looking at the largest 3 PMIs (3.76277, 3.72022, and 0.06578), we see that these tell us the CEO/company pairings."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Computing phi-square and chi-square metrics that tell us how far people and companies are from being independent\n",
"\n",
"PMI compares $P(A,B)$ and $P(A)P(B)$ by looking at the log of their ratio. Phi-square instead looks at\n",
"$$\\sum_{A,B} \\frac{(P(A,B)-P(A)P(B))^2}{P(A)P(B)}.$$\n",
"\n",
"If $N$ is the total number of co-occurrences in the original co-occurrence table, then chi-square is $N$ multiplied by the phi-square value."
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [],
"source": [
"numer = (joint_prob_table - joint_prob_table_if_people_and_companies_are_indep)**2"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [],
"source": [
"denom = joint_prob_table_if_people_and_companies_are_indep"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"0.5430990706343366\n"
]
}
],
"source": [
"phi_square = (numer / denom).sum()\n",
"print(phi_square)"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"11565\n"
]
}
],
"source": [
"N = co_occurrence_table.sum()\n",
"print(N)"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"6280.940751886103\n"
]
}
],
"source": [
"chi_square = N * phi_square\n",
"print(chi_square)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.5.4"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment