Skip to content

Instantly share code, notes, and snippets.

@joernklinger
Created January 30, 2015 00:18
Show Gist options
  • Save joernklinger/e4c4936ad09bc1560546 to your computer and use it in GitHub Desktop.
Save joernklinger/e4c4936ad09bc1560546 to your computer and use it in GitHub Desktop.
Clustering via Unsupervised Naive Bayes
{
"metadata": {
"name": "",
"signature": "sha256:10fecd6f42087a84f85522e829048cfc70d69141d2bde51e88daed4b8f1c9ddc"
},
"nbformat": 3,
"nbformat_minor": 0,
"worksheets": [
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<h1>Introduction</h1>\n",
"<br>\n",
"I recently wanted to cluster Twitter users using unsupervised Naive Bayes. Unfortunately there is little help on how to do this on the internet. Thus I\u2019ve put together this tutorial.\n",
"<br><br>\n",
"<b>The goal:</b> Build a Naive Bayes Model to cluster Twitter users into two classes.\n",
"<br><br>\n",
"<b>The data:</b> A matrix of Twitter users with m rows and n columns. In this case we have m = 1780 rows and n = 500 columns. Each of the 1780 rows represents a user and each of the 500 columns represents a feature, i.e whether the user follows a specific top-500 Twitter account. Examples of these most-followed accounts include political figures, athletes, pop stars and collective accounts that cover a specific topic.\n",
"<br><br>\n",
"We will focus on the Twitter users here, but the clustering works for any data set that is organized in such an MxN matrix.\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<h1>A bunch of names and concepts:</h1>\n",
"<br>\n",
"There are a number of terms and variable names that will be used throughout. I decided to use really long explanatory names for the most important variables, and I will refer to the things that they denote by the variables names throughout the tutorial. E.g. instead of saying, then we update the probabilities that a randomly drawn user is from class 0 or 1, I will say then we update the <b>class_probabilities_log</b>.\n",
"<br><br>\n",
"I\u2019ll explain these here.\n",
"<br><br>\n",
"<b>data:</b> The records of Twitter users and which of the Top-500 accounts they follow.\n",
"<br><br>\n",
"<b>class_probabilities_log:</b> The probability that a randomly sampled user is in class 0 or 1. We initialize them as log(0.5), log(0.5). Thus we initially assume that both classes are similarly likely. Because there are two classes the size of this vector is 2.\n",
"<br><br>\n",
"<b>conditional_probabilities_log:</b> For each Top-500 Twitter account (i.e. for each feature), these are the probabilities that a user from class 0 or class 1 follows that account. We initialize all the conditional probabilities for class 0 as log(0.5) and all those for class 1 as a random number very close to log(0.5). Because there are 500 features and 2 classes, this matrix is of the size 500x2.\n",
"<br><br>\n",
"<b>probability_user_in_class_log:</b> For each user this is the probability that, given the accounts that this user follows, the user belongs to class 0 or 1. This will be calculated after the model is initialized, thus we leave the matrix empty for now. Since there are 1780 users and 2 classes, the size of the matrix is 1780x2.\n",
"<br><br>\n",
"<b>probability_data_given_parameters_log:</b> How well the model agrees with the actual data. Also known as loglikelihood. This is always negative and gets closer to zero the better the model becomes.\n",
"<br><br>\n",
"<b>Why logarithms?</b> I use logarithms because in raw space the numbers can get too small, which might cause problems."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<h1>The idea in a nutshell:</h1>\n",
"<br>\n",
"After initializing the variables mentioned above, the clustering works in 4 steps that are then repeated until reasonable clusters are found. Notice that Step 1 and 2 correspond to the 'performing an expectation' (E) step and the 'maximization' (M) step of the Expectation Maximization algorithm respectively. Similarly, Step 4 can also be seen as a maximization (M) step.\n",
"<br><br>\n",
"<b>Step 1:</b> Given the actual data of whom each user follows and the randomly initialized conditional probabilities, for each user, we calculate the probability that that user is in class 0 or 1 (probability_user_in_class_log)\n",
"<br><br>\n",
"<b>Step 2:</b> Given the probability_user_in_class_log that we just updated and the actual accounts that the users follow, we now update the conditional probabilities (probabilities that an account is followed by class 0 or 1).\n",
"<br><br>\n",
"<b>Step 3:</b> Now we want some measure of how good our model is. Thus we calculate, given the parameters we just estimated in Step 1 and 2 , how well our model predicts the actual data, i.e. the loglikelihood.\n",
"<br><br>\n",
"<b>Step 4:</b> Here we update the class_probabilities_log, i.e. how likely it is for a random user to be in class 0 or 1.\n",
"<br><br>\n",
"Then we repeat step 1-4. With each repetition the model should get better at predicting the actual data, thus the loglikelihood should get closer to 0.\n",
"<br>\n",
"<br>\n",
"Now that the concept has been explained, let\u2019s turn it into actual code that you can run. This can be done in almost any language, but using one that does well with matrix transformation (e.g. Matlab, R, Python+Numpy) makes it a lot less painful. I used Python and Numpy."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<h1>The Code</h1>\n",
"<br>\n",
"\n",
"Let\u2019s first import the necessary modules and define a function that we\u2019ll use throughout.\n"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"# Import modules\n",
"import numpy as np\n",
"from numpy import log\n",
"import pdb\n",
"import cPickle as pickle\n",
"\n",
"# logsumexp(array) \n",
"def logsumexp(a):\n",
" a = np.asarray(a)\n",
" a_max = a.max()\n",
" return a_max + np.log((np.exp(a-a_max)).sum())"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 1
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Then, unless you have your own data in the format mentioned at the beginning, let\u2019s create some toy data. Using toy data also makes it easier to find errors in your model / math, because you know what to expect."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"# Users 0:999 follow mostly accounts 0:249\n",
"# Users 1000:1779 follow mostly accounts 250:499\n",
"toy_data = np.empty([1780, 500], dtype=int)\n",
"for user in range(1000):\n",
" n, p = 1, .7\n",
" nr_r = np.random.binomial(n, p, 250)\n",
" toy_data[user, :250] = nr_r\n",
"\n",
" n, p = 1, .3\n",
" nr_r = np.random.binomial(n, p, 250)\n",
" toy_data[user, 250:500] = nr_r\n",
"\n",
"for user in range(1000, 1780):\n",
" n, p = 1, .3\n",
" nr_r = np.random.binomial(n, p, 250)\n",
" toy_data[user, :250] = nr_r\n",
"\n",
" n, p = 1, .7\n",
" nr_r = np.random.binomial(n, p, 250)\n",
" toy_data[user, 250:500] = nr_r\n",
"\n",
"data = toy_data"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 2
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Here we create toy <b>data</b> of 1780 users and 500 accounts that they follow. Users 0 to 999 follow mostly accounts 0 to 249, while users 1000 to 779 follow mostly accounts 250 to 499. So we ultimately expect the model to put users 0 to 999 in one class and 1000 to 1779 into the other class."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Next let\u2019s initialize the model and define some variables, such as number of users that we will use over and over again. Finally we also initialize an empty matrix for the <b>probability_user_in_class_log</b>:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"# Initialize last_probability_data_given_parameters_log as negative infinity\n",
"last_probability_data_given_parameters_log = -np.inf\n",
"\n",
"# probabilities that a randomly drawn user is in class 0 or 1 [p_c1, p_c2]\n",
"class_probabilities_log = log([0.5, 0.5])\n",
"\n",
"# conditional probabilities for each feature by class\n",
"conditional_probabilities_log = np.empty([500, 2], dtype=float)\n",
"\n",
"# for one class we initialize them all as log(0.5)\n",
"# for the other class we initialize them as a random close to log(0.5)\n",
"for row in xrange(len(conditional_probabilities_log)):\n",
" conditional_probabilities_log[row, 0] = log(0.5)\n",
" conditional_probabilities_log[row, 1] = log(np.random.normal(loc=0.5, scale=0.0001))\n",
"\n",
"# Define for easier access\n",
"nr_of_users = data.shape[0]\n",
"nr_of_features = data.shape[1]\n",
"\n",
"# user_class_probabilities_log are the probabilities that\n",
"# each particular user is in class 0 or 1\n",
"# probability_user_in_class_log rows: users, cols: log([p(class0), p(class1)])\n",
"probability_user_in_class_log = np.empty([nr_of_users, 2], dtype=float)"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 3
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now we want to run Step 1, 2, 3, and 4:\n",
"<br>\n",
"<br>\n",
"In <b>Step 1</b>, we want to calculate the probability of each user being in class 0 and 1 (<b>probabilities_user_in_class[0:1]</b>).<br><br>\n",
"To do so, we iterate over each user in the actual data, and over each of the 500 accounts (features).\n",
"If that user actually follows an account, we add the conditional probability that that account is followed in class 0 and 1 (<b>conditional_pronbabilities_log[account][0 / 1]</b>) to the variables <b>probability_feature_terms_given_class[0]</b> and <b>probability_feature_terms_given_class[1]</b>.<br><br>\n",
"Then we add the <b>class_probabilities_log[0]</b> and <b>class_probabilities_log[1]</b>, which gives us the probability of that user being in class 0 and class 1. Normally we would multiply by <b>class_probabilities_log</b>, but remember that we are in log space.<br><br>\n",
"Finally we normalize this so that the probability of a user being in class 0 and class 1 add up to 1, which means that each user must be assigned to a class and add these normalized values to the probability_user_in_class_log matrix."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"for user in xrange(nr_of_users):\n",
" probability_feature_terms_given_class = np.zeros([2, 1], dtype=float)\n",
" for feature in xrange(nr_of_features):\n",
" if data[user, feature] == 1:\n",
" probability_feature_terms_given_class[0] += conditional_probabilities_log[feature][0]\n",
" probability_feature_terms_given_class[1] += conditional_probabilities_log[feature][1]\n",
"\n",
"\n",
" probability_user_in_class_log[user][0] = (probability_feature_terms_given_class[0] +\n",
" class_probabilities_log[0])\n",
" probability_user_in_class_log[user][1] = (probability_feature_terms_given_class[1] +\n",
" class_probabilities_log[1])\n",
"\n",
" # Normalize probability_user_in_class\n",
" log_p0, log_p1 = probability_user_in_class_log[user]\n",
" z = np.logaddexp(log_p0, log_p1)\n",
" log_p0 = log_p0 - z\n",
" log_p1 = log_p1 - z\n",
"\n",
" probability_user_in_class_log[user] = [log_p0, log_p1]"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 4
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now we can proceed with Step 2. Here we use the <b>probability_user_in_class_log</b> that we just calculated and the actual data to update the <b>conditional_probabilities_log</b>.<br><br>\n",
"To do this we first initialize a variable called <b>feature_count</b>. This variable stores for each account, how well that account is represented in the data, given <b>probability_user_in_class_log</b>.<br><br>\n",
"So if a user X follows an account Y, we add to <b>feature_counts[Y][0]</b> the probability of that user being in class 0 (<b>probability_user_in_class_log[user][0]</b>) and to <b>feature_counts[Y][1]</b> the probability of that user being in class 1 (<b>probability_user_in_class_log[user][1]</b>).<br><br>\n",
"These feature_counts later need to be normalized by dividing them by a variable <b>z</b>, which is the overall probability of the account Y being followed regardless of class.<br><br>\n",
"This is done by iterating over all Top-500 accounts, and then for each account iterating over all users in the data. If a user then actually follows an account, we add the <b>probability_user_in_class_log[user][0:1]</b> to <b>feature_counts[feature][0:1]</b>.<br><br>\n",
"Regardless of whether the user follows the account, we add the <b>probability_user_in_class_log</b> to the denominator <b>z</b> that is later used for normalization.<br><br>\n",
"After we iterate over all the users for one feature, we normalize the <b>conditional_probabilities_log</b> and move on to the next account, where we again iterate over all the users."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"# Update the conditional probabilities log\n",
"# feature_counts: rows: features [count_class1, count_class2]\n",
"feature_counts = np.ones([nr_of_features, 2], dtype=float)\n",
"\n",
"for feature in xrange(nr_of_features):\n",
" z = np.zeros(2)\n",
" for user in xrange(nr_of_users):\n",
" if data[user][feature] == 1:\n",
" feature_counts[feature] += np.exp(probability_user_in_class_log[user])\n",
" z += np.exp(probability_user_in_class_log[user])\n",
" conditional_probabilities_log[feature] = np.log(feature_counts[feature]) - np.log(z)"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 5
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now we can proceed to Step 3, calculating the probability of the entire data given the parameters we just estimated. To do this, we calculate the probability of each user given the data and the current parameters and then take the sum of all of these.<br><br>\n",
"\n",
"This means that we iterate over each user and for each user we iterate over all 500 accounts. If the user follows an account, we add to the probability of that user given the data (<b>probability_user_given_data_log[user]</b>) the conditional probability for that account (<b>conditional_probabilities_log[feature][0:1]</b>). We do this for all features and then move on to the next user.<br><br>\n",
"\n",
"<b>Note:</b> We are adding probabilities here. This is because the probabilities are in log space and addition in log space corresponds to multiplications in raw space.\n",
"<br><br>\n",
"\n",
"Finally we take the sum of <b>probability_user_given_data</b> and we have the probability of the entire data given the parameters we estimated in Step 1 and 2, i.e. the loglikelihood of the data."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"# Calculate probability of the entire data given curren parameters\n",
"probability_user_given_data_log = np.zeros(nr_of_users, dtype=float)# np.zeros([nr_of_users, 2], dtype=float)\n",
"for user in range(nr_of_users):\n",
" for feature in range(nr_of_features):\n",
" if data[user][feature] == 1:\n",
" p_user_given_0_log = (conditional_probabilities_log[feature][0]+ probability_user_in_class_log[user][0])\n",
" p_user_given_1_log = (conditional_probabilities_log[feature][1]+ probability_user_in_class_log[user][1])\n",
"\n",
" probability_user_given_data_log[user] += np.logaddexp(p_user_given_0_log, p_user_given_1_log)\n",
"\n",
"probability_data_given_parameters_log = np.sum(probability_user_given_data_log)"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 6
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Finally we need to update the class_probabilities_log (probability that a randomly drawn user is from class 0 or 1).<br><br>\n",
"We do this by taking the log of the summed exponentials of <b>probability_user_in_class_log</b> for class 0 and 1 respectively and then subtracting the log(number of users) from each for normalization.\n",
"<br><br>\n",
"<b>Note:</b> Again we are in log spence. Hence we subtract instead of dividing."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"# Update class probailities log\n",
"p0_total_log = logsumexp(probability_user_in_class_log[:, 0])\n",
"p1_total_log = logsumexp(probability_user_in_class_log[:, 1])\n",
"class_probabilities_log = np.array([p0_total_log - np.log(nr_of_users), p1_total_log - np.log(nr_of_users)])"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 7
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now let\u2019s print out some stats, so we can see how the model is doing.\n",
"<br><br>\n",
"The <b>loglikelihood</b> is an indicator of how well the current estimates of the model describe the actual data. \n",
"<br><br>\n",
"And class_proababilities_raw give you an idea of how the users are currently being distributed into the two classes."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"print 'Log Likelihood: ' + str(probability_data_given_parameters_log)\n",
"\n",
"# Raw class probabilities\n",
"class_probabilities_raw = np.zeros(2)\n",
"class_probabilities_raw[0] = np.exp(class_probabilities_log[0])\n",
"class_probabilities_raw[1] = np.exp(class_probabilities_log[1])\n",
"\n",
"print 'Raw Class Probabilities: ' + str(class_probabilities_raw)"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"Log Likelihood: -306804.478584\n",
"Raw Class Probabilities: [ 0.50015865 0.49984135]\n"
]
}
],
"prompt_number": 8
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now the model has gone through 1 iteration. We now want to repeat Steps 1 to 4 until the loglikelihood does not improve anymore. The loglikelihood should become closer to 0 each time we repeat Steps 1 to 4."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<h1>Loop with X Iterations</h1>\n",
"<br>\n",
"Copy and pasting the code each time we want to run the model for another iteration is not very practical. So we can just wrap it in a loop."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"# Import modules\n",
"import numpy as np\n",
"from numpy import log\n",
"import pdb\n",
"import cPickle as pickle\n",
"\n",
"\n",
"def logsumexp(a):\n",
" a = np.asarray(a)\n",
" a_max = a.max()\n",
" return a_max + np.log((np.exp(a-a_max)).sum())\n",
"\n",
"\n",
"# Create toy data\n",
"\n",
"toy_data = np.empty([1780, 500], dtype=int)\n",
"for user in range(1000):\n",
" n, p = 1, .7\n",
" nr_r = np.random.binomial(n, p, 250)\n",
" toy_data[user, :250] = nr_r\n",
"\n",
" n, p = 1, .3\n",
" nr_r = np.random.binomial(n, p, 250)\n",
" toy_data[user, 250:500] = nr_r\n",
"\n",
"for user in range(1000, 1780):\n",
" n, p = 1, .3\n",
" nr_r = np.random.binomial(n, p, 250)\n",
" toy_data[user, :250] = nr_r\n",
"\n",
" n, p = 1, .7\n",
" nr_r = np.random.binomial(n, p, 250)\n",
" toy_data[user, 250:500] = nr_r\n",
"\n",
"data = toy_data\n",
"\n",
"\n",
"# == Initialize the model\n",
"# Set iterations of the model\n",
"iterations = 5\n",
"\n",
"# probabilities that a randomly drawn user is in class 0 or 1 [p_c1, p_c2]\n",
"class_probabilities_log = log([0.5, 0.5])\n",
"\n",
"# conditional probabilities for each feature by class\n",
"conditional_probabilities_log = np.empty([500, 2], dtype=float)\n",
"\n",
"# for one class we initialize them all as log(0.5)\n",
"# for the other class we initialize them as a random close to log(0.5)\n",
"for row in xrange(len(conditional_probabilities_log)):\n",
" conditional_probabilities_log[row, 0] = log(0.5)\n",
" conditional_probabilities_log[row, 1] = log(np.random.normal(loc=0.5, scale=0.0001))\n",
"\n",
"# Define for easier access\n",
"nr_of_users = data.shape[0]\n",
"nr_of_features = data.shape[1]\n",
"\n",
"# user_class_probabilities_log are the probabilities that\n",
"# each particular user is in class 0 or 1\n",
"# probability_user_in_class_log rows: users, cols: log([p(class0), p(class1)])\n",
"probability_user_in_class_log = np.empty([nr_of_users, 2], dtype=float)\n",
"\n",
"# == Run the model for x iterations\n",
"for it in xrange(iterations):\n",
"\n",
" # probability_feature_terms_given_class [p1, p2]\n",
"\n",
" for user in xrange(nr_of_users):\n",
" probability_feature_terms_given_class = np.zeros([2, 1], dtype=float)\n",
" for feature in xrange(nr_of_features):\n",
" if data[user, feature] == 1:\n",
" probability_feature_terms_given_class[0] += conditional_probabilities_log[feature][0]\n",
" probability_feature_terms_given_class[1] += conditional_probabilities_log[feature][1]\n",
"\n",
"\n",
" probability_user_in_class_log[user][0] = (probability_feature_terms_given_class[0] +\n",
" class_probabilities_log[0])\n",
" probability_user_in_class_log[user][1] = (probability_feature_terms_given_class[1] +\n",
" class_probabilities_log[1])\n",
"\n",
" # Normalize probability_user_in_class\n",
" log_p0, log_p1 = probability_user_in_class_log[user]\n",
" z = np.logaddexp(log_p0, log_p1)\n",
" log_p0 = log_p0 - z\n",
" log_p1 = log_p1 - z\n",
"\n",
" probability_user_in_class_log[user] = [log_p0, log_p1]\n",
" \n",
"\n",
" # Update the conditional probabilities log\n",
" # feature_counts: rows: features [counts_class0, counts_class0]\n",
" feature_counts = np.ones([nr_of_features, 2], dtype=float)\n",
"\n",
" for feature in xrange(nr_of_features):\n",
" z = np.zeros(2)\n",
" for user in xrange(nr_of_users):\n",
" if data[user][feature] == 1:\n",
" feature_counts[feature] += np.exp(probability_user_in_class_log[user])\n",
" z += np.exp(probability_user_in_class_log[user])\n",
" conditional_probabilities_log[feature] = np.log(feature_counts[feature]) - np.log(z)\n",
"\n",
"\n",
" # it seems that feature_counts is sorted, which makes sense,\n",
" # because feature 0 is the most followed account\n",
" # and feature 499 the least followed account\n",
"\n",
" # Calculate probability of the entire data given curren parameters\n",
" probability_user_given_data_log = np.zeros(nr_of_users, dtype=float)# np.zeros([nr_of_users, 2], dtype=float)\n",
" for user in range(nr_of_users):\n",
" for feature in range(nr_of_features):\n",
" if data[user][feature] == 1:\n",
" p_user_given_0_log = (conditional_probabilities_log[feature][0] + probability_user_in_class_log[user][0])\n",
" p_user_given_1_log = (conditional_probabilities_log[feature][1] + probability_user_in_class_log[user][1])\n",
"\n",
" probability_user_given_data_log[user] += np.logaddexp(p_user_given_0_log, p_user_given_1_log)\n",
"\n",
" probability_data_given_parameters_log = np.sum(probability_user_given_data_log)\n",
"\n",
" # Update class probailities log\n",
" p0_total_log = logsumexp(probability_user_in_class_log[:, 0])\n",
" p1_total_log = logsumexp(probability_user_in_class_log[:, 1])\n",
" class_probabilities_log = np.array([p0_total_log - np.log(nr_of_users), p1_total_log - np.log(nr_of_users)])\n",
" \n",
" # Raw class probabilities\n",
" class_probabilities_raw = np.zeros(2)\n",
" class_probabilities_raw[0] = np.exp(class_probabilities_log[0])\n",
" class_probabilities_raw[1] = np.exp(class_probabilities_log[1])\n",
" # Now give stats on the model's state\n",
" print 'Iteration: ' + str(it)\n",
" print 'Raw Class Probabilities: ' + str(class_probabilities_raw)\n",
" print 'Log Likelihood: ' + str(probability_data_given_parameters_log)"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"Iteration: 0\n",
"Raw Class Probabilities: [ 0.50066205 0.49933795]\n",
"Log Likelihood: -306824.911117\n",
"Iteration: 1"
]
},
{
"output_type": "stream",
"stream": "stdout",
"text": [
"\n",
"Raw Class Probabilities: [ 0.50062014 0.49937986]\n",
"Log Likelihood: -306816.029179\n",
"Iteration: 2"
]
},
{
"output_type": "stream",
"stream": "stdout",
"text": [
"\n",
"Raw Class Probabilities: [ 0.4980543 0.5019457]\n",
"Log Likelihood: -295566.826547\n",
"Iteration: 3"
]
},
{
"output_type": "stream",
"stream": "stdout",
"text": [
"\n",
"Raw Class Probabilities: [ 0.43820225 0.56179775]\n",
"Log Likelihood: -270671.034527\n",
"Iteration: 4"
]
},
{
"output_type": "stream",
"stream": "stdout",
"text": [
"\n",
"Raw Class Probabilities: [ 0.43820225 0.56179775]\n",
"Log Likelihood: -270671.034528\n"
]
}
],
"prompt_number": 9
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The model now runs for 5 iterations (or whatever you set the variable iterations to).\n",
"<br><br>\n",
"You might notice that after a few iterations the <b>loglikelihood</b> does not improve anymore. This can mean that the model has converged, i.e. is done and has found a way to cluster the users into the classes. You can hit CTRL+C to interrupt the model and look at <b>probability_user_in_class_log</b> to check which class it puts each user into."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<h1>Running multiple models</h1>\n",
"<br>\n",
"Because part of the initialization of the models is random, running the model several times may lead to different results each time.\n",
"<br><br>\n",
"One common problem is that the model gets stuck in a local minimum, i.e. the <b>loglikelihood</b> will not improve anymore, but the model is also nowhere near finding ideal clusters.\n",
"<br><br>\n",
"To make sure that the clusters we find are close to ideal, we can run multiple models instead of just one and in the end pick the model with the best <b>loglikelihood</b>.\n",
"<br><br>\n",
"The following code shows how that is done. I've only added another loop that runs as many models as specified and I now log the outputs of the models into files."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"# Import modules\n",
"import numpy as np\n",
"from numpy import log\n",
"import pdb\n",
"import cPickle as pickle\n",
"\n",
"def logsumexp(a):\n",
" a = np.asarray(a)\n",
" a_max = a.max()\n",
" return a_max + np.log((np.exp(a-a_max)).sum())\n",
"\n",
"\n",
"# Create toy data, data = toy_data to use toy data instead of real data\n",
"# I though maybe the real data was too noisy,\n",
"# but we face the same problems when using toy data\n",
"\n",
"# Users 0:999 follow mostly accounts 0:249\n",
"# Users 1000:1779 follow mostly accounts 250:499\n",
"toy_data = np.empty([1780, 500], dtype=int)\n",
"for user in range(1000):\n",
" n, p = 1, .7\n",
" nr_r = np.random.binomial(n, p, 250)\n",
" toy_data[user, :250] = nr_r\n",
"\n",
" n, p = 1, .3\n",
" nr_r = np.random.binomial(n, p, 250)\n",
" toy_data[user, 250:500] = nr_r\n",
"\n",
"for user in range(1000, 1780):\n",
" n, p = 1, .3\n",
" nr_r = np.random.binomial(n, p, 250)\n",
" toy_data[user, :250] = nr_r\n",
"\n",
" n, p = 1, .7\n",
" nr_r = np.random.binomial(n, p, 250)\n",
" toy_data[user, 250:500] = nr_r\n",
"\n",
"data = toy_data\n",
"\n",
"# --- Initialize the model ---\n",
"\n",
"# set parameter for models, summary and iterations and SD\n",
"\n",
"# Set number of different models\n",
"models = 3\n",
"iterations = 5\n",
"\n",
"# summary data file; contains stats for each model and each iteration\n",
"summary = np.zeros([models * iterations, 4])\n",
"all_conditional_probabilities = []\n",
"all_probabilities_user_in_class = []\n",
"all_class_probabilities = []\n",
"\n",
"counts = 0\n",
"\n",
"for model in range(models):\n",
" # Initialize last_probability_data_given_parameters_log as something super low\n",
" last_probability_data_given_parameters_log = (-1)*10**20\n",
"\n",
" # probabilities that a randomly drawn user is in class 0 or 1 [p_c1, p_c2]\n",
" class_probabilities_log = log([0.5, 0.5])\n",
"\n",
" # conditional probabilities for each feature by class\n",
" conditional_probabilities_log = np.empty([500, 2], dtype=float)\n",
"\n",
" # for one class we initialize them all as log(0.5)\n",
" # for the other class we initialize them as a random close to log(0.5)\n",
" for row in xrange(len(conditional_probabilities_log)):\n",
" conditional_probabilities_log[row, 0] = log(0.5)\n",
" conditional_probabilities_log[row, 1] = log(np.random.normal(loc=0.5, scale=0.0001))\n",
"\n",
" # Define for easier access\n",
" nr_of_users = data.shape[0]\n",
" nr_of_features = data.shape[1]\n",
"\n",
" # user_class_probabilities_log are the probabilities that\n",
" # each particular user is in class 0 or 1\n",
" # probability_user_in_class_log rows: users, cols: log([p(class0), p(class1)])\n",
" probability_user_in_class_log = np.empty([nr_of_users, 2], dtype=float)\n",
"\n",
" # --- Run ---\n",
" # == Outer Loop ==\n",
" for it in xrange(iterations):\n",
"\n",
" # == Inner Loop ==\n",
" # probability_feature_terms_given_class [p1, p2]\n",
"\n",
" for user in xrange(nr_of_users):\n",
" probability_feature_terms_given_class = np.zeros([2, 1], dtype=float)\n",
" for feature in xrange(nr_of_features):\n",
" if data[user, feature] == 1:\n",
" probability_feature_terms_given_class[0] += conditional_probabilities_log[feature][0]\n",
" probability_feature_terms_given_class[1] += conditional_probabilities_log[feature][1]\n",
"\n",
"\n",
" probability_user_in_class_log[user][0] = (probability_feature_terms_given_class[0] +\n",
" class_probabilities_log[0])\n",
" probability_user_in_class_log[user][1] = (probability_feature_terms_given_class[1] +\n",
" class_probabilities_log[1])\n",
"\n",
" # Normalize probability_user_in_class\n",
" log_p0, log_p1 = probability_user_in_class_log[user]\n",
" z = np.logaddexp(log_p0, log_p1)\n",
" log_p0 = log_p0 - z\n",
" log_p1 = log_p1 - z\n",
"\n",
" probability_user_in_class_log[user] = [log_p0, log_p1]\n",
"\n",
" # Update the conditional probabilities log\n",
" # feature_counts: rows: features [count_class1, count_class2]\n",
" feature_counts = np.ones([nr_of_features, 2], dtype=float)\n",
"\n",
" for feature in xrange(nr_of_features):\n",
" z = np.zeros(2)\n",
" for user in xrange(nr_of_users):\n",
" if data[user][feature] == 1:\n",
" feature_counts[feature] += np.exp(probability_user_in_class_log[user])\n",
" z += np.exp(probability_user_in_class_log[user])\n",
" conditional_probabilities_log[feature] = np.log(feature_counts[feature]) - np.log(z)\n",
"\n",
"\n",
" # Calculate probability of the entire data given curren parameters\n",
" probability_user_given_data_log = np.zeros(nr_of_users, dtype=float)# np.zeros([nr_of_users, 2], dtype=float)\n",
" for user in range(nr_of_users):\n",
" for feature in range(nr_of_features):\n",
" if data[user][feature] == 1:\n",
" p_user_given_0_log = (conditional_probabilities_log[feature][0]+ probability_user_in_class_log[user][0])\n",
" p_user_given_1_log = (conditional_probabilities_log[feature][1]+ probability_user_in_class_log[user][1])\n",
"\n",
" probability_user_given_data_log[user] += np.logaddexp(p_user_given_0_log, p_user_given_1_log)\n",
"\n",
" probability_data_given_parameters_log = np.sum(probability_user_given_data_log)\n",
"\n",
" # Update class probailities log\n",
" p0_total_log = logsumexp(probability_user_in_class_log[:, 0])\n",
" p1_total_log = logsumexp(probability_user_in_class_log[:, 1])\n",
" class_probabilities_log = np.array([p0_total_log - np.log(nr_of_users), p1_total_log - np.log(nr_of_users)])\n",
"\n",
" # Raw class probabilities\n",
" class_probabilities_raw = np.zeros(2)\n",
" class_probabilities_raw[0] = np.exp(class_probabilities_log[0])\n",
" class_probabilities_raw[1] = np.exp(class_probabilities_log[1])\n",
" # Now give stats on the model's state\n",
" print 'Model: ' + str(model)\n",
" print 'Iteration: ' + str(it)\n",
" print 'Raw Class Probabilities: ' + str(class_probabilities_raw)\n",
" print 'Log Likelihood: ' + str(probability_data_given_parameters_log)\n",
"\n",
" # Summary: [model, iteration, loglikelihood]\n",
" summary[counts] = [counts, model, it, probability_data_given_parameters_log]\n",
"\n",
" all_conditional_probabilities.append([counts, model, it, conditional_probabilities_log])\n",
" all_probabilities_user_in_class.append([counts, model, it, probability_user_in_class_log])\n",
" all_class_probabilities.append([counts, model, it, class_probabilities_raw])\n",
"\n",
" counts += 1\n",
"\n",
" if last_probability_data_given_parameters_log >= probability_data_given_parameters_log:\n",
" print('Local Minimum; Break.')\n",
" break\n",
"\n",
" last_probability_data_given_parameters_log = probability_data_given_parameters_log\n",
"\n",
"# Write results to files\n",
"\n",
"# The comments explain the structure of each of these output data structures\n",
"# They are save in the respective .pickle files\n",
"\n",
"\n",
"# summary[counts] = [counts, model, it, probability_data_given_parameters_log]\n",
"g = open('summary.pickle', 'a')\n",
"pickle.dump(summary, g)\n",
"g.close()\n",
"\n",
"# all_conditional_probabilities.append([counts, model, it, conditional_probabilities_log])\n",
"h = open('all_conditional_probabilities.pickle', 'a')\n",
"pickle.dump(all_conditional_probabilities, h)\n",
"h.close()\n",
"\n",
"# all_probabilities_user_in_class.append([counts, model, it, probability_user_in_class_log])\n",
"i = open('all_probabilities_user_in_class.pickle', 'a')\n",
"pickle.dump(all_probabilities_user_in_class, i)\n",
"i.close()\n",
"\n",
"# Summary: [model, iteration, loglikelihood]\n",
"l = open('all_class_probabilities.pickle', 'a')\n",
"pickle.dump(all_class_probabilities, l)\n",
"l.close()"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"Model: 0\n",
"Iteration: 0\n",
"Raw Class Probabilities: [ 0.50062392 0.49937608]\n",
"Log Likelihood: -306835.163272\n",
"Model: 0"
]
},
{
"output_type": "stream",
"stream": "stdout",
"text": [
"\n",
"Iteration: 1\n",
"Raw Class Probabilities: [ 0.50063561 0.49936439]\n",
"Log Likelihood: -306834.818771\n",
"Model: 0"
]
},
{
"output_type": "stream",
"stream": "stdout",
"text": [
"\n",
"Iteration: 2\n",
"Raw Class Probabilities: [ 0.50020554 0.49979446]\n",
"Log Likelihood: -306287.610221\n",
"Model: 0"
]
},
{
"output_type": "stream",
"stream": "stdout",
"text": [
"\n",
"Iteration: 3\n",
"Raw Class Probabilities: [ 0.43952649 0.56047351]\n",
"Log Likelihood: -270560.38289\n",
"Model: 0"
]
},
{
"output_type": "stream",
"stream": "stdout",
"text": [
"\n",
"Iteration: 4\n",
"Raw Class Probabilities: [ 0.43820225 0.56179775]\n",
"Log Likelihood: -270581.07347\n",
"Local Minimum; Break.\n",
"Model: 1"
]
},
{
"output_type": "stream",
"stream": "stdout",
"text": [
"\n",
"Iteration: 0\n",
"Raw Class Probabilities: [ 0.50036027 0.49963973]\n",
"Log Likelihood: -306835.090345\n",
"Model: 1"
]
},
{
"output_type": "stream",
"stream": "stdout",
"text": [
"\n",
"Iteration: 1\n",
"Raw Class Probabilities: [ 0.50021452 0.49978548]\n",
"Log Likelihood: -306718.571023\n",
"Model: 1"
]
},
{
"output_type": "stream",
"stream": "stdout",
"text": [
"\n",
"Iteration: 2\n",
"Raw Class Probabilities: [ 0.45871057 0.54128943]\n",
"Log Likelihood: -271524.971179\n",
"Model: 1"
]
},
{
"output_type": "stream",
"stream": "stdout",
"text": [
"\n",
"Iteration: 3\n",
"Raw Class Probabilities: [ 0.43820225 0.56179775]\n",
"Log Likelihood: -270581.07347\n",
"Model: 1"
]
},
{
"output_type": "stream",
"stream": "stdout",
"text": [
"\n",
"Iteration: 4\n",
"Raw Class Probabilities: [ 0.43820225 0.56179775]\n",
"Log Likelihood: -270581.07347\n",
"Local Minimum; Break.\n",
"Model: 2"
]
},
{
"output_type": "stream",
"stream": "stdout",
"text": [
"\n",
"Iteration: 0\n",
"Raw Class Probabilities: [ 0.50064378 0.49935622]\n",
"Log Likelihood: -306835.147283\n",
"Model: 2"
]
},
{
"output_type": "stream",
"stream": "stdout",
"text": [
"\n",
"Iteration: 1\n",
"Raw Class Probabilities: [ 0.50053634 0.49946366]\n",
"Log Likelihood: -306809.372645\n",
"Model: 2"
]
},
{
"output_type": "stream",
"stream": "stdout",
"text": [
"\n",
"Iteration: 2\n",
"Raw Class Probabilities: [ 0.51215959 0.48784041]\n",
"Log Likelihood: -283786.109272\n",
"Model: 2"
]
},
{
"output_type": "stream",
"stream": "stdout",
"text": [
"\n",
"Iteration: 3\n",
"Raw Class Probabilities: [ 0.56179775 0.43820225]\n",
"Log Likelihood: -270581.07347\n",
"Model: 2"
]
},
{
"output_type": "stream",
"stream": "stdout",
"text": [
"\n",
"Iteration: 4\n",
"Raw Class Probabilities: [ 0.56179775 0.43820225]\n",
"Log Likelihood: -270581.07347\n"
]
}
],
"prompt_number": 10
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<h1>Find the best model</h1>\n",
"<br>\n",
"Now you load summary.pickle into a variable. The matrix has the following columns:<br><br>\n",
"<b>[counts, model, it, probability_data_given_parameters_log]</b><br><br>\n",
"They are explained below:\n",
"<br>\n",
"<br>\n",
"<b>counts:</b> row number\n",
"<br>\n",
"<br>\n",
"<b>model:</b> number of the model\n",
"<br>\n",
"<br>\n",
"<b>it</b>: iteration in the current model\n",
"<br>\n",
"<br>\n",
"<b>probability_data_given_parameters_log</b>: loglikelihood of the current model at the current iteration\n",
"<br>\n",
"<br>\n",
"The matrix for the summary was created beforehand with a number of rows equal to the number of models times the number of iterations for each model. The code above does however skip to the next model, once the current model has converged and the <b>loglikelihood</b> does not improve anymore. Thus some rows of the summary matrix might be unused. You can find (and delete) these by looking for rows in which each column is 0.\n",
"<br><br>\n",
"Once you have found the best fitting model, you look up the value for <b>counts</b>. That value indicates \n",
"<br><br>\n",
"<b>a)</b> in which row of <b>all_conditional_probabilities</b> you find the conditional probabilities for that model, \n",
"<br><br>\n",
"<b>b)</b> in which row of <b>all_probabilities_user_in_class</b> you find the the probabilities that a given user is in class 0 or 1 for that model / in which cluster each user is.\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<h1>Thanks!</h1>\n",
"\n",
"That is it for now. In the next part of this tutorial, I will explain, how one can scale the model up to create more than 2 clusters.\n",
"<br>\n",
"<br>\n",
"If you have any questions or comments: <b>@joern82k</b> on Twitter."
]
}
],
"metadata": {}
}
]
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment