Created
March 16, 2022 16:37
-
-
Save georgehc/934c82f8bf588776be5855f8541bd7c0 to your computer and use it in GitHub Desktop.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
{ | |
"cells": [ | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"# 94-775/95-865: Co-Occurrence Analysis for Finding Possible Relationships\n", | |
"\n", | |
"Author: George H. Chen (georgechen [at symbol] cmu.edu)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"We begin by importing `numpy`, and telling it to displays numbers to 5 decimal places and suppress printing out tiny numbers in scientific notation:" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 1, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"import numpy as np\n", | |
"np.set_printoptions(precision=5, suppress=True)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## Calculating joint and marginal probability tables given a co-occurrence table" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"We work off the following co-occurrence table:\n", | |
"\n", | |
"| <i></i> | Alphabet | AMD | Tesla |\n", | |
"| ------------- |:--------:|:----:|:-----:|\n", | |
"| Elon Musk | 1500 | 1000 | 20000 |\n", | |
"| Sundar Pichai | 1000 | 50 | 50 |\n", | |
"| Lisa Su | 30 | 700 | 10 |\n", | |
"\n", | |
"So Elon Musk and Alphabet co-occur in 1500 news articles, Elon Musk and AMD co-occur in 1000 news articles, etc." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 2, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"[[ 1500 1000 20000]\n", | |
" [ 1000 50 50]\n", | |
" [ 30 700 10]]\n" | |
] | |
} | |
], | |
"source": [ | |
"co_occurrence_table = np.array([[1500, 1000, 20000],\n", | |
" [1000, 50, 50],\n", | |
" [30, 700, 10]])\n", | |
"print(co_occurrence_table)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 3, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"(3, 3)" | |
] | |
}, | |
"execution_count": 3, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"co_occurrence_table.shape" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 4, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"num_people, num_companies = co_occurrence_table.shape" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"The joint probability table can be obtained by dividing every entry of the co-occurrence table by the total number of co-occurrences:" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 5, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"24340" | |
] | |
}, | |
"execution_count": 5, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"co_occurrence_table.sum()" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 6, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"[[0.06163 0.04108 0.82169]\n", | |
" [0.04108 0.00205 0.00205]\n", | |
" [0.00123 0.02876 0.00041]]\n" | |
] | |
} | |
], | |
"source": [ | |
"joint_prob_table = co_occurrence_table / co_occurrence_table.sum()\n", | |
"print(joint_prob_table)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"To get the marginal probabilities **P(Elon Musk)**, **P(Sundar Pichai)**, and **P(Lisa Su)**, we sum the joint probability table across columns. In `numpy`, axis 0 corresponds to rows, and axis 1 corresponds to columns. To sum across columns, we do the following:" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 7, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"[0.9244 0.04519 0.0304 ]\n" | |
] | |
} | |
], | |
"source": [ | |
"people_prob = joint_prob_table.sum(axis=1)\n", | |
"print(people_prob)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 8, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"(3,)" | |
] | |
}, | |
"execution_count": 8, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"people_prob.shape" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"To get the marginal probabilities **P(Alphabet)**, **P(AMD)**, and **P(Tesla)**, we sum across rows of the joint probability table:" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 9, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"[0.10394 0.0719 0.82416]\n" | |
] | |
} | |
], | |
"source": [ | |
"company_prob = joint_prob_table.sum(axis=0)\n", | |
"print(company_prob)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Next, we compute what the joint probability table would be *if people and companies were independent*. We show two different approaches for doing this. The first is a straightforward calculation that uses the formula $P(A, B)=P(A)P(B)$ when $A$ and $B$ are independent:" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 10, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"[[0.09609 0.06646 0.76185]\n", | |
" [0.0047 0.00325 0.03725]\n", | |
" [0.00316 0.00219 0.02506]]\n" | |
] | |
} | |
], | |
"source": [ | |
"joint_prob_table_if_people_and_companies_were_indep = np.zeros((num_people, num_companies))\n", | |
"for row_idx in range(num_people):\n", | |
" for col_idx in range(num_companies):\n", | |
" joint_prob_table_if_people_and_companies_were_indep[row_idx, col_idx] = people_prob[row_idx] * company_prob[col_idx]\n", | |
"print(joint_prob_table_if_people_and_companies_were_indep)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"The more elegant, slicker approach for those who know linear algebra is to recognize that we just need to take the outer product between the marginal probabilities for people and the marginal probabilities for companies:" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 11, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"[[0.09609 0.06646 0.76185]\n", | |
" [0.0047 0.00325 0.03725]\n", | |
" [0.00316 0.00219 0.02506]]\n" | |
] | |
} | |
], | |
"source": [ | |
"joint_prob_table_if_people_and_companies_were_indep = np.outer(people_prob, company_prob)\n", | |
"print(joint_prob_table_if_people_and_companies_were_indep)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## Computing pointwise mutual information (PMI)\n", | |
"\n", | |
"Next, we compute the PMI between each person and each company. The formula for PMI is $$PMI(A,B)=\\log\\frac{P(A,B)}{P(A)P(B)},$$where the base of the logarithm is not actually important (we will use log base 2)." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 12, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"[[-0.64077 -0.69395 0.10908]\n", | |
" [ 3.12862 -0.66153 -4.18042]\n", | |
" [-1.35837 3.71773 -5.93045]]\n" | |
] | |
} | |
], | |
"source": [ | |
"PMI = np.log2(joint_prob_table / joint_prob_table_if_people_and_companies_were_indep)\n", | |
"print(PMI)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"By ranking the 9 PMIs from largest to smallest, and looking at the largest 3 PMIs (3.76277, 3.72022, and 0.06578), we see that these tell us the person/company pairings." | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## Computing phi-squared, chi-squared, and Cramér's V metrics that tell us how far people and companies are from being independent\n", | |
"\n", | |
"PMI compares $P(A,B)$ and $P(A)P(B)$ by looking at the log of their ratio. Phi-squared instead looks at\n", | |
"$$\\sum_{A,B} \\frac{(P(A,B)-P(A)P(B))^2}{P(A)P(B)}.$$\n", | |
"\n", | |
"If $N$ is the total number of co-occurrences in the original co-occurrence table, then chi-squared is $N$ multiplied by the phi-squared value." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 13, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"numer = (joint_prob_table - joint_prob_table_if_people_and_companies_were_indep)**2" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 14, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"[[0.00119 0.00064 0.00358]\n", | |
" [0.00132 0. 0.00124]\n", | |
" [0. 0.00071 0.00061]]\n" | |
] | |
} | |
], | |
"source": [ | |
"print(numer)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 15, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"denom = joint_prob_table_if_people_and_companies_were_indep" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 16, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"array([[0.01236, 0.00969, 0.0047 ],\n", | |
" [0.28185, 0.00044, 0.03325],\n", | |
" [0.00118, 0.32305, 0.02424]])" | |
] | |
}, | |
"execution_count": 16, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"numer / denom" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 17, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"0.6907550159572748\n" | |
] | |
} | |
], | |
"source": [ | |
"phi_squared = (numer / denom).sum()\n", | |
"print(phi_squared)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 18, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"24340\n" | |
] | |
} | |
], | |
"source": [ | |
"N = co_occurrence_table.sum()\n", | |
"print(N)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 19, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"16812.977088400068\n" | |
] | |
} | |
], | |
"source": [ | |
"chi_squared = N * phi_squared\n", | |
"print(chi_squared)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 20, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"0.5876882744947677\n" | |
] | |
} | |
], | |
"source": [ | |
"max_phi_squared_value = min(co_occurrence_table.shape)-1\n", | |
"cramers_V = np.sqrt(phi_squared / max_phi_squared_value)\n", | |
"print(cramers_V)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"If we collect co-occurrence tables for other pairings of categories (not just people vs companies), then we can rank these pairings based on their phi-squared, chi-squared, or Cramér's V values (it does not matter which of these you use to rank as they will yield the exact same ordering)." | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## Relating to Pearson's chi-squared statistical test\n", | |
"\n", | |
"This last part of the demo assumes that you know [Pearson's chi-squared test for independence](https://en.wikipedia.org/wiki/Pearson%27s_chi-squared_test). This is effectively what we are doing: seeing if people and companies are independent (if they are, then in subsequent data analyses, we can focus on other relationships that could be more promising). However, for the chi-squared test to actually work as advertised, we need to check that the assumptions of the test hold (otherwise the p-value derived from the test is *not* valid!). Importantly, in practice, often we don't care about p-values and don't need to report them either. *In fact, we might only be interested in computing the phi-squared/chi-squared/Cramér's V values only for the purposes of ranking different pairings of named entity types, in which case the exact values don't really matter, just the ordering!*\n", | |
"\n", | |
"First off, it is important to understand that in unstructured data analysis, sometimes how we use statistical tools is going to be a bit of a hack, meaning that strictly speaking we are not using a tool like how it is supposed to be used. Don't panic. In all sorts of endeavors involving working with data, there are going to be hacks. That said, it is good to know when something we use is a hack and why!\n", | |
"\n", | |
"In computing the `chi_squared` variable from earlier, I mentioned that it is related to the chi-squared test. However, there is a catch: for it to actually correspond to a chi-squared test, we need the co-occurrence table to have a specific independence condition *that is not guaranteed to hold in how we constructed the co-occurrence table!*\n", | |
"\n", | |
"Specifically, recall that the co-occurrence table we used in lecture and in the demo is:\n", | |
"\n", | |
"| <i></i> | Alphabet | AMD | Tesla |\n", | |
"| ------------- |:--------:|:----:|:-----:|\n", | |
"| Elon Musk | 1500 | 1000 | 20000 |\n", | |
"| Sundar Pichai | 1000 | 50 | 50 |\n", | |
"| Lisa Su | 30 | 700 | 10 |\n", | |
"\n", | |
"Here, the numbers correspond to the number of articles that mention both a specific person and a specific company.\n", | |
"\n", | |
"In Pearson's chi-squared test, a key assumption is that of the 1500 articles counted that mention both Elon Musk and Alphabet, these 1500 articles are *not* also counted in any of the other table entries. For example, for the 1000 articles that mention both Elon Musk and AMD, these 1000 articles should not have any overlap with the 1500 articles that mention both Elon Musk and Alphabet! Basically, for any specific article, it can only be counted once across all the numbers in the table.\n", | |
"\n", | |
"How could we have constructed the co-occurrence table so that we are guaranteed that this key assumption holds? There's a simple fix: each time we look at an article, we look at all the co-occurrences it contains, and we arbitrarily choose one of them (for example, we can uniformly at random pick one of the pairings of person/company found) and that's the only co-occurrence that we actually add to the co-occurrence table for that specific article. This fix would ensure that each article is counted at most once in the co-occurrence table! (An article could be counted 0 times if it doesn't mention any person/company pairs.)\n", | |
"\n", | |
"For the chi-squared statistical test to make sense, we do need this key assumption to hold. A full list of assumptions needed for the chi-squared test can be found on the [chi-square test Wikipedia page](https://en.wikipedia.org/wiki/Pearson%27s_chi-squared_test). **As a reminder, when not all the assumptions are met, then the p-value you obtain will not be statistically valid.** If we don't actually care about statistical significance/validity, we can still compute phi-squared/chi-squared and choose an arbitrary cutoff to decide on whether to flag people and companies as being an interesting pairing to analyze further, or to consider them to be an uninteresting pairing that does not warrant further analysis.\n", | |
"\n", | |
"With the disclaimer above, we now proceed with a chi-squared statistical test." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 21, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"from scipy.stats import chi2_contingency" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 22, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"chi_squared_test_statistic, p_val, _, expected_counts = chi2_contingency(N * joint_prob_table)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"First off, we can check that the chi-square test statistic does match what we computed earlier (up to a very small numerical difference)." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 23, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"16812.97708840007" | |
] | |
}, | |
"execution_count": 23, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"chi_squared_test_statistic" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 24, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"3.637978807091713e-12" | |
] | |
}, | |
"execution_count": 24, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"abs(chi_squared_test_statistic - chi_squared)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"We can also look at the \"expected counts\" automatically computed by `chi2_contingency`; this is precisely the number of co-occurrences assuming that people and companies are independent:" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 25, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"array([[ 2338.74281, 1617.70748, 18543.54971],\n", | |
" [ 114.33854, 79.08792, 906.57354],\n", | |
" [ 76.91865, 53.2046 , 609.87675]])" | |
] | |
}, | |
"execution_count": 25, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"expected_counts" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 26, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"array([[ 2338.74281, 1617.70748, 18543.54971],\n", | |
" [ 114.33854, 79.08792, 906.57354],\n", | |
" [ 76.91865, 53.2046 , 609.87675]])" | |
] | |
}, | |
"execution_count": 26, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"N * joint_prob_table_if_people_and_companies_were_indep" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Lastly, we look at the p-value, noting that this is up to some sort of numerical precision:" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 27, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"0.0" | |
] | |
}, | |
"execution_count": 27, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"p_val" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Basically the p-value is considered extremely small (so much so that it's just numerically evaluated to be 0.0) --- so if the assumptions of the chi-squared test are met, then we can declare that the people and companies are not independent." | |
] | |
} | |
], | |
"metadata": { | |
"kernelspec": { | |
"display_name": "Python 3", | |
"language": "python", | |
"name": "python3" | |
}, | |
"language_info": { | |
"codemirror_mode": { | |
"name": "ipython", | |
"version": 3 | |
}, | |
"file_extension": ".py", | |
"mimetype": "text/x-python", | |
"name": "python", | |
"nbconvert_exporter": "python", | |
"pygments_lexer": "ipython3", | |
"version": "3.8.8" | |
} | |
}, | |
"nbformat": 4, | |
"nbformat_minor": 2 | |
} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment