georgehc/94-775, 95-865 Co-Occurrence Analysis for Finding Possible Relationships.ipynb Secret

## 94-775, 95-865 Co-Occurrence Analysis for Finding Possible Relationships.ipynb
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 94-775/95-865: Co-Occurrence Analysis for Finding Possible Relationships\n",
    "\n",
    "Author: George H. Chen (georgechen [at symbol] cmu.edu)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We begin by importing `numpy`, and telling it to displays numbers to 5 decimal places and suppress printing out tiny numbers in scientific notation:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "np.set_printoptions(precision=5, suppress=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Calculating joint and marginal probability tables given a co-occurrence table"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We work off the following co-occurrence table:\n",
    "\n",
    "\n",
    "| <i></i>         | Apple | Facebook | Tesla |\n",
    "| --------------- |:-----:|:--------:|:-----:|\n",
    "| Elon Musk       | 10    | 15       | 300   |\n",
    "| Mark Zuckerberg | 500   | 10000    | 500   |\n",
    "| Tim Cook        | 200   | 30       | 10    |\n",
    "\n",
    "\n",
    "So Elon Musk and Apple co-occur in 10 news articles, Elon Musk and Facebook co-occur in 15 news articles, etc."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[[   10    15   300]\n",
      " [  500 10000   500]\n",
      " [  200    30    10]]\n"
     ]
    }
   ],
   "source": [
    "co_occurrence_table = np.array([[10, 15, 300],\n",
    "                                [500, 10000, 500],\n",
    "                                [200, 30, 10]])\n",
    "print(co_occurrence_table)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The joint probability table can be obtained by dividing every entry of the co-occurrence table by the total number of co-occurrences:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[[0.00086 0.0013  0.02594]\n",
      " [0.04323 0.86468 0.04323]\n",
      " [0.01729 0.00259 0.00086]]\n"
     ]
    }
   ],
   "source": [
    "joint_prob_table = co_occurrence_table / co_occurrence_table.sum()\n",
    "print(joint_prob_table)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To get the marginal probabilities **P(Elon Musk)**, **P(Mark Zuckerberg)**, and **P(Tim Cook)**, we sum the joint probability table across columns. In `numpy`, axis 0 corresponds to rows, and axis 1 corresponds to columns. To sum across columns, we do the following:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[0.0281  0.95115 0.02075]\n"
     ]
    }
   ],
   "source": [
    "people_prob = joint_prob_table.sum(axis=1)\n",
    "print(people_prob)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To get the marginal probabilities **P(Apple)**, **P(Facebook)**, and **P(Tesla)**, we sum across rows of the joint probability table:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[0.06139 0.86857 0.07004]\n"
     ]
    }
   ],
   "source": [
    "company_prob = joint_prob_table.sum(axis=0)\n",
    "print(company_prob)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next, we compute what the joint probability table would be *if people and companies were independent*. We show two different approaches for doing this. The first is a straightforward calculation that uses the formula $P(A, B)=P(A)P(B)$ when $A$ and $B$ are independent:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[[0.00173 0.02441 0.00197]\n",
      " [0.05839 0.82614 0.06662]\n",
      " [0.00127 0.01802 0.00145]]\n"
     ]
    }
   ],
   "source": [
    "joint_prob_table_if_people_and_companies_are_indep = np.zeros((len(people_prob), len(company_prob)))\n",
    "for row_idx in range(len(people_prob)):\n",
    "    for col_idx in range(len(company_prob)):\n",
    "        joint_prob_table_if_people_and_companies_are_indep[row_idx, col_idx] = people_prob[row_idx] * company_prob[col_idx]\n",
    "print(joint_prob_table_if_people_and_companies_are_indep)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The more elegant, slicker approach for those who know linear algebra is to recognize that we just need to take the outer product between the marginal probabilities for people and the marginal probabilities for companies:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[[0.00173 0.02441 0.00197]\n",
      " [0.05839 0.82614 0.06662]\n",
      " [0.00127 0.01802 0.00145]]\n"
     ]
    }
   ],
   "source": [
    "joint_prob_table_if_people_and_companies_are_indep = np.outer(people_prob, company_prob)\n",
    "print(joint_prob_table_if_people_and_companies_are_indep)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Computing pointwise mutual information (PMI)\n",
    "\n",
    "Next, we compute the PMI between each person and each company. The formula for PMI is $$PMI(A,B)=\\log\\frac{P(A,B)}{P(A)P(B)},$$where the base of the logarithm is not actually important (we will use log base 2)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[[-0.99657 -4.23412  3.72022]\n",
      " [-0.43363  0.06578 -0.62373]\n",
      " [ 3.76277 -2.79671 -0.74926]]\n"
     ]
    }
   ],
   "source": [
    "PMI = np.log2(joint_prob_table /\n",
    "              joint_prob_table_if_people_and_companies_are_indep)\n",
    "print(PMI)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "By ranking the 9 PMIs from largest to smallest, and looking at the largest 3 PMIs (3.76277, 3.72022, and 0.06578), we see that these tell us the CEO/company pairings."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Computing phi-square and chi-square metrics that tell us how far people and companies are from being independent\n",
    "\n",
    "PMI compares $P(A,B)$ and $P(A)P(B)$ by looking at the log of their ratio. Phi-square instead looks at\n",
    "$$\\sum_{A,B} \\frac{(P(A,B)-P(A)P(B))^2}{P(A)P(B)}.$$\n",
    "\n",
    "If $N$ is the total number of co-occurrences in the original co-occurrence table, then chi-square is $N$ multiplied by the phi-square value."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "numer = (joint_prob_table - joint_prob_table_if_people_and_companies_are_indep)**2"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [],
   "source": [
    "denom = joint_prob_table_if_people_and_companies_are_indep"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0.5430990706343366\n"
     ]
    }
   ],
   "source": [
    "phi_square = (numer / denom).sum()\n",
    "print(phi_square)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "11565\n"
     ]
    }
   ],
   "source": [
    "N = co_occurrence_table.sum()\n",
    "print(N)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "6280.940751886103\n"
     ]
    }
   ],
   "source": [
    "chi_square = N * phi_square\n",
    "print(chi_square)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.4"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
	{
	"cells": [
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"# 94-775/95-865: Co-Occurrence Analysis for Finding Possible Relationships\n",
	"\n",
	"Author: George H. Chen (georgechen [at symbol] cmu.edu)"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"We begin by importing `numpy`, and telling it to displays numbers to 5 decimal places and suppress printing out tiny numbers in scientific notation:"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 1,
	"metadata": {},
	"outputs": [],
	"source": [
	"import numpy as np\n",
	"np.set_printoptions(precision=5, suppress=True)"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## Calculating joint and marginal probability tables given a co-occurrence table"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"We work off the following co-occurrence table:\n",
	"\n",
	"\n",
	"\| <i></i> \| Apple \| Facebook \| Tesla \|\n",
	"\| --------------- \|:-----:\|:--------:\|:-----:\|\n",
	"\| Elon Musk \| 10 \| 15 \| 300 \|\n",
	"\| Mark Zuckerberg \| 500 \| 10000 \| 500 \|\n",
	"\| Tim Cook \| 200 \| 30 \| 10 \|\n",
	"\n",
	"\n",
	"So Elon Musk and Apple co-occur in 10 news articles, Elon Musk and Facebook co-occur in 15 news articles, etc."
	]
	},
	{
	"cell_type": "code",
	"execution_count": 2,
	"metadata": {},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"[[ 10 15 300]\n",
	" [ 500 10000 500]\n",
	" [ 200 30 10]]\n"
	]
	}
	],
	"source": [
	"co_occurrence_table = np.array([[10, 15, 300],\n",
	" [500, 10000, 500],\n",
	" [200, 30, 10]])\n",
	"print(co_occurrence_table)"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"The joint probability table can be obtained by dividing every entry of the co-occurrence table by the total number of co-occurrences:"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 3,
	"metadata": {},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"[[0.00086 0.0013 0.02594]\n",
	" [0.04323 0.86468 0.04323]\n",
	" [0.01729 0.00259 0.00086]]\n"
	]
	}
	],
	"source": [
	"joint_prob_table = co_occurrence_table / co_occurrence_table.sum()\n",
	"print(joint_prob_table)"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"To get the marginal probabilities P(Elon Musk), P(Mark Zuckerberg), and P(Tim Cook), we sum the joint probability table across columns. In `numpy`, axis 0 corresponds to rows, and axis 1 corresponds to columns. To sum across columns, we do the following:"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 4,
	"metadata": {},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"[0.0281 0.95115 0.02075]\n"
	]
	}
	],
	"source": [
	"people_prob = joint_prob_table.sum(axis=1)\n",
	"print(people_prob)"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"To get the marginal probabilities P(Apple), P(Facebook), and P(Tesla), we sum across rows of the joint probability table:"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 5,
	"metadata": {},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"[0.06139 0.86857 0.07004]\n"
	]
	}
	],
	"source": [
	"company_prob = joint_prob_table.sum(axis=0)\n",
	"print(company_prob)"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Next, we compute what the joint probability table would be if people and companies were independent. We show two different approaches for doing this. The first is a straightforward calculation that uses the formula $P(A, B)=P(A)P(B)$ when $A$ and $B$ are independent:"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 6,
	"metadata": {},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"[[0.00173 0.02441 0.00197]\n",
	" [0.05839 0.82614 0.06662]\n",
	" [0.00127 0.01802 0.00145]]\n"
	]
	}
	],
	"source": [
	"joint_prob_table_if_people_and_companies_are_indep = np.zeros((len(people_prob), len(company_prob)))\n",
	"for row_idx in range(len(people_prob)):\n",
	" for col_idx in range(len(company_prob)):\n",
	" joint_prob_table_if_people_and_companies_are_indep[row_idx, col_idx] = people_prob[row_idx] * company_prob[col_idx]\n",
	"print(joint_prob_table_if_people_and_companies_are_indep)"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"The more elegant, slicker approach for those who know linear algebra is to recognize that we just need to take the outer product between the marginal probabilities for people and the marginal probabilities for companies:"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 7,
	"metadata": {},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"[[0.00173 0.02441 0.00197]\n",
	" [0.05839 0.82614 0.06662]\n",
	" [0.00127 0.01802 0.00145]]\n"
	]
	}
	],
	"source": [
	"joint_prob_table_if_people_and_companies_are_indep = np.outer(people_prob, company_prob)\n",
	"print(joint_prob_table_if_people_and_companies_are_indep)"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## Computing pointwise mutual information (PMI)\n",
	"\n",
	"Next, we compute the PMI between each person and each company. The formula for PMI is $$PMI(A,B)=\\log\\frac{P(A,B)}{P(A)P(B)},$$where the base of the logarithm is not actually important (we will use log base 2)."
	]
	},
	{
	"cell_type": "code",
	"execution_count": 8,
	"metadata": {},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"[[-0.99657 -4.23412 3.72022]\n",
	" [-0.43363 0.06578 -0.62373]\n",
	" [ 3.76277 -2.79671 -0.74926]]\n"
	]
	}
	],
	"source": [
	"PMI = np.log2(joint_prob_table /\n",
	" joint_prob_table_if_people_and_companies_are_indep)\n",
	"print(PMI)"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"By ranking the 9 PMIs from largest to smallest, and looking at the largest 3 PMIs (3.76277, 3.72022, and 0.06578), we see that these tell us the CEO/company pairings."
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## Computing phi-square and chi-square metrics that tell us how far people and companies are from being independent\n",
	"\n",
	"PMI compares $P(A,B)$ and $P(A)P(B)$ by looking at the log of their ratio. Phi-square instead looks at\n",
	"$$\\sum_{A,B} \\frac{(P(A,B)-P(A)P(B))^2}{P(A)P(B)}.$$\n",
	"\n",
	"If $N$ is the total number of co-occurrences in the original co-occurrence table, then chi-square is $N$ multiplied by the phi-square value."
	]
	},
	{
	"cell_type": "code",
	"execution_count": 9,
	"metadata": {},
	"outputs": [],
	"source": [
	"numer = (joint_prob_table - joint_prob_table_if_people_and_companies_are_indep)**2"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 10,
	"metadata": {},
	"outputs": [],
	"source": [
	"denom = joint_prob_table_if_people_and_companies_are_indep"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 11,
	"metadata": {},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"0.5430990706343366\n"
	]
	}
	],
	"source": [
	"phi_square = (numer / denom).sum()\n",
	"print(phi_square)"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 12,
	"metadata": {},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"11565\n"
	]
	}
	],
	"source": [
	"N = co_occurrence_table.sum()\n",
	"print(N)"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 13,
	"metadata": {},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"6280.940751886103\n"
	]
	}
	],
	"source": [
	"chi_square = N * phi_square\n",
	"print(chi_square)"
	]
	}
	],
	"metadata": {
	"kernelspec": {
	"display_name": "Python 3",
	"language": "python",
	"name": "python3"
	},
	"language_info": {
	"codemirror_mode": {
	"name": "ipython",
	"version": 3
	},
	"file_extension": ".py",
	"mimetype": "text/x-python",
	"name": "python",
	"nbconvert_exporter": "python",
	"pygments_lexer": "ipython3",
	"version": "3.5.4"
	}
	},
	"nbformat": 4,
	"nbformat_minor": 2
	}