Last active
September 26, 2019 21:27
-
-
Save alinaselega/594450cef2bffd7ade8a6967c492b526 to your computer and use it in GitHub Desktop.
This notebook numerically checks the form for the marginal multinomial likelihood
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
{ | |
"nbformat": 4, | |
"nbformat_minor": 0, | |
"metadata": { | |
"colab": { | |
"name": "multinomial_pmf_numerics.ipynb", | |
"provenance": [], | |
"collapsed_sections": [] | |
}, | |
"kernelspec": { | |
"name": "python3", | |
"display_name": "Python 3" | |
} | |
}, | |
"cells": [ | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "9d8yzppwKINZ", | |
"colab_type": "text" | |
}, | |
"source": [ | |
"This notebook numerically checks the value for a marginal multinomial PMF.\n", | |
"\n", | |
"If $(X_1, ..., X_m)$ are a random variables counting the number of successes of each of the $m$ categories with probabilities of success $(p_1, ..., p_m)$ out of a total of $n$ trials, then the marginal likelihood of the subset of the random variables $(X_1, ..., X_k)$, where $k < m$ is given by:\n", | |
"\n", | |
"\\begin{eqnarray}\n", | |
" p(X_1 = x_1, ..., X_k = x_k) = \\frac{n!}{x_1! ... x_k! (n-z_k)!} p_1^{x_1} ... p_k^{x_k} q_k^{n-z_k} \\\\\n", | |
" z_k = x_1 + ... + x_k \\\\ \n", | |
" q_k = 1 - p_1 - ... - p_k\n", | |
"\\end{eqnarray}\n", | |
"\n", | |
"The general form of this marginal likelihood is given by:\n", | |
"\n", | |
"\\begin{eqnarray}\n", | |
" p(X_1 = x_1, ..., X_k = x_k) = \\sum_{x_{k+1}} ... \\sum_{x_m} p(X_1 = x_1, ..., X_m = x_m) = \\sum_{x_{k+1}} ... \\sum_{x_m} \\frac{n!}{x_1! ... x_m!} p_1^{x_1} ... p_k^{x_m}\n", | |
"\\end{eqnarray}" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "wON6XvmXbxwn", | |
"colab_type": "text" | |
}, | |
"source": [ | |
"Generate some multinomial observations:" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "6P0zgo6LJNLu", | |
"colab_type": "code", | |
"colab": { | |
"base_uri": "https://localhost:8080/", | |
"height": 35 | |
}, | |
"outputId": "2c5246a9-bdef-4f06-b94e-5a9636578b70" | |
}, | |
"source": [ | |
"import scipy.stats as st\n", | |
"import numpy as np\n", | |
"\n", | |
"G = 7\n", | |
"N = 21\n", | |
"\n", | |
"e = np.random.normal(size=G)\n", | |
"p = np.exp(e) / sum(np.exp(e))\n", | |
"\n", | |
"rv = st.multinomial(N, p)\n", | |
"obs = rv.rvs()\n", | |
"print(obs)" | |
], | |
"execution_count": 19, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"text": [ | |
"[[5 1 7 5 1 2 0]]\n" | |
], | |
"name": "stdout" | |
} | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "zOsumzhWb1oP", | |
"colab_type": "text" | |
}, | |
"source": [ | |
"This is the full log PMF:" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "fJpYvjIuLntZ", | |
"colab_type": "code", | |
"colab": { | |
"base_uri": "https://localhost:8080/", | |
"height": 35 | |
}, | |
"outputId": "c0e4c315-f7b5-41cf-a498-02e79b974b4f" | |
}, | |
"source": [ | |
"rv.logpmf(obs)" | |
], | |
"execution_count": 20, | |
"outputs": [ | |
{ | |
"output_type": "execute_result", | |
"data": { | |
"text/plain": [ | |
"array([-8.09912776])" | |
] | |
}, | |
"metadata": { | |
"tags": [] | |
}, | |
"execution_count": 20 | |
} | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "2MUH0qvjb41I", | |
"colab_type": "text" | |
}, | |
"source": [ | |
"Here is the subset we care about; isolate corresponding observations:" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "JOt8CZWBMZV3", | |
"colab_type": "code", | |
"colab": { | |
"base_uri": "https://localhost:8080/", | |
"height": 35 | |
}, | |
"outputId": "560fd719-9543-4655-80e3-7c1ba4f36797" | |
}, | |
"source": [ | |
"# Evaluate the marginal likelihood of the first 2 count variables\n", | |
"k = 2\n", | |
"subset = range(k)\n", | |
"\n", | |
"obs = obs[0][:]\n", | |
"print(obs)" | |
], | |
"execution_count": 21, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"text": [ | |
"[5 1 7 5 1 2 0]\n" | |
], | |
"name": "stdout" | |
} | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "8khJ29s0R4HA", | |
"colab_type": "code", | |
"colab": { | |
"base_uri": "https://localhost:8080/", | |
"height": 35 | |
}, | |
"outputId": "8fbe4a4d-ea6a-4e7a-986e-fbb056af19db" | |
}, | |
"source": [ | |
"subset_sum = sum(obs[0:k])\n", | |
"subset_sum" | |
], | |
"execution_count": 22, | |
"outputs": [ | |
{ | |
"output_type": "execute_result", | |
"data": { | |
"text/plain": [ | |
"6" | |
] | |
}, | |
"metadata": { | |
"tags": [] | |
}, | |
"execution_count": 22 | |
} | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "1SMfoG1Mb94_", | |
"colab_type": "text" | |
}, | |
"source": [ | |
"Make an array with every permutation of remaining categories' counts; note that the $\\sum_{i=1}^{m} x_i = n$, so the possible values for the remaining categories counts that we must sum over are bounded by $n - z_k$." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "ivg-k4CVR6i4", | |
"colab_type": "code", | |
"colab": {} | |
}, | |
"source": [ | |
"from sympy.utilities.iterables import variations\n", | |
"\n", | |
"perms = list(variations(range(N - subset_sum + 1), G-k-1, True))\n", | |
"#print(perms)\n", | |
"\n", | |
"perms_array = np.zeros((len(perms), G - k))\n", | |
"\n", | |
"for i in range(len(perms)):\n", | |
" perms_array[i, 0:(G-k-1)] = perms[i]\n", | |
" sum_item = sum(perms[i])\n", | |
" perms_array[i, G-k-1] = max(N - subset_sum - sum_item, 0)\n", | |
" if (sum(perms_array[i,:]) > N - subset_sum):\n", | |
" perms_array[i,:] = np.nan\n", | |
" \n", | |
"mask = np.where(np.isnan(perms_array[:,0]))\n", | |
"perms_array = np.delete(perms_array, mask[0], 0)\n", | |
"#perms_array\n" | |
], | |
"execution_count": 0, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "ZT3dR0zicc9_", | |
"colab_type": "text" | |
}, | |
"source": [ | |
"Compute the exact log PMF." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "D6Do9wSyZYcn", | |
"colab_type": "code", | |
"colab": { | |
"base_uri": "https://localhost:8080/", | |
"height": 35 | |
}, | |
"outputId": "6e4cd7c8-107b-49a4-8a2c-28af958098ae" | |
}, | |
"source": [ | |
"exact_pmf = 0\n", | |
"\n", | |
"values = np.zeros(G)\n", | |
"values[0:k] = obs[0:k]\n", | |
"\n", | |
"for item in perms_array:\n", | |
" values[k:G] = item\n", | |
" exact_pmf = exact_pmf + rv.pmf(values)\n", | |
" \n", | |
"np.log(exact_pmf)" | |
], | |
"execution_count": 24, | |
"outputs": [ | |
{ | |
"output_type": "execute_result", | |
"data": { | |
"text/plain": [ | |
"-2.61695788104343" | |
] | |
}, | |
"metadata": { | |
"tags": [] | |
}, | |
"execution_count": 24 | |
} | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "1ZrN-CMacgaA", | |
"colab_type": "text" | |
}, | |
"source": [ | |
"Now compute it with my implementation for evaluating the multinomial term and the marginal log-likelihood." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "ZBbjdwCOawC3", | |
"colab_type": "code", | |
"colab": {} | |
}, | |
"source": [ | |
"from scipy.special import binom\n", | |
"from scipy import special as special\n", | |
"\n", | |
"# Compute a multinomial coefficient as a product of binomials\n", | |
"def log_multinomial_coef(ks):\n", | |
" if len(ks) == 1:\n", | |
" return 0\n", | |
" return np.log(binom(sum(ks), ks[-1])) + log_multinomial_coef(ks[:-1])\n", | |
"\n", | |
"\n", | |
"# Compute a log marginal multinomial coefficient\n", | |
"def log_marginal_multinomial_coef(ys, n):\n", | |
" in_segment_counts = sum(ys)\n", | |
" component1 = log_multinomial_coef(ys)\n", | |
" \n", | |
" num = np.zeros(n - in_segment_counts)\n", | |
" denom = np.zeros(n - in_segment_counts)\n", | |
" for i in range(1, n - in_segment_counts + 1):\n", | |
" num[i-1] = in_segment_counts + i\n", | |
" denom[i-1] = i\n", | |
"\n", | |
" new_num = [x for x in num if x not in denom]\n", | |
" new_denom = [x for x in denom if x not in num]\n", | |
" \n", | |
" if((len(new_num) == 0) and (len(new_denom) == 0)):\n", | |
" component2 = 0\n", | |
" else:\n", | |
" component2 = sum(np.log(new_num)) - sum(np.log(new_denom))\n", | |
" \n", | |
" return component1 + component2" | |
], | |
"execution_count": 0, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "cD-n9IXDa7W3", | |
"colab_type": "code", | |
"colab": { | |
"base_uri": "https://localhost:8080/", | |
"height": 71 | |
}, | |
"outputId": "08042636-a248-42c8-adb6-416dc10bead2" | |
}, | |
"source": [ | |
"y_marg = np.zeros(k + 1)\n", | |
"y_marg[0:k] = obs[0:k]\n", | |
"y_marg[-1] = N - subset_sum\n", | |
"\n", | |
"p_marg = np.zeros(k + 1)\n", | |
"p_marg[0:k] = p[0:k]\n", | |
"p_marg[-1] = 1 - sum(p[0:k])\n", | |
"\n", | |
"print(y_marg)\n", | |
"print(p_marg)\n", | |
"\n", | |
"# Log marginal loglik\n", | |
"loglik = log_marginal_multinomial_coef(obs[0:k], N) + sum(y_marg * np.log(p_marg))\n", | |
"loglik\n" | |
], | |
"execution_count": 26, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"text": [ | |
"[ 5. 1. 15.]\n", | |
"[0.23912625 0.03520192 0.72567183]\n" | |
], | |
"name": "stdout" | |
}, | |
{ | |
"output_type": "execute_result", | |
"data": { | |
"text/plain": [ | |
"-2.6169578810434277" | |
] | |
}, | |
"metadata": { | |
"tags": [] | |
}, | |
"execution_count": 26 | |
} | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "0xB9gmric0cv", | |
"colab_type": "text" | |
}, | |
"source": [ | |
"This is computing the multinomial factor directly as the values are small enough." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "E7k6USpecQ9o", | |
"colab_type": "code", | |
"colab": { | |
"base_uri": "https://localhost:8080/", | |
"height": 35 | |
}, | |
"outputId": "034f629d-dee1-47a6-80e8-37bf7bb4bd92" | |
}, | |
"source": [ | |
"#Let's directly compute that factor\n", | |
"\n", | |
"from scipy.special import factorial\n", | |
"lik = factorial(N) / (np.prod(factorial(obs[0:k])) * factorial(N - subset_sum)) * np.prod(np.power(p_marg, y_marg))\n", | |
"np.log(lik)" | |
], | |
"execution_count": 27, | |
"outputs": [ | |
{ | |
"output_type": "execute_result", | |
"data": { | |
"text/plain": [ | |
"-2.616957881043428" | |
] | |
}, | |
"metadata": { | |
"tags": [] | |
}, | |
"execution_count": 27 | |
} | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "bdsW5ei4c9iP", | |
"colab_type": "text" | |
}, | |
"source": [ | |
"All values match!" | |
] | |
} | |
] | |
} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment