josef-pkt/gist:3cf9c6a2bf304904c839 Secret

## gistfile1.py
{
 "metadata": {
  "name": "",
  "signature": "sha256:b1c0ef2fa78fc58c40e0a0507d039c8e442f6f80e88171ff751164506af203b7"
 },
 "nbformat": 3,
 "nbformat_minor": 0,
 "worksheets": [
  {
   "cells": [
    {
     "cell_type": "heading",
     "level": 1,
     "metadata": {},
     "source": [
      "Comparing GEE and cluster robust standard errors for GLM-Poisson and Poisson"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "**Sections**\n",
      "\n",
      " - [GEE with Independence correlation structure](#GEE-with-Independence-correlation-structure)\n",
      "\n",
      " - [GLM with cluster robust standard errors](#GLM-with-cluster-robust-standard-errors)\n",
      " \n",
      " - [Poisson with cluster robust standard errors](#Poisson-with-cluster-robust-standard-errors)\n",
      " \n",
      " - [GLM and Poisson with cluster robust standard errors and t distribution based inference](#GLM-and-Poisson-with-cluster-robust-standard-errors-and-t-distribution-based-inference)\n",
      " \n",
      " - [Comparison](#Comparison)\n",
      " \n",
      " - [Appendix: Convariance-Types-in-GEE](#Appendix:-Convariance-Types-in-GEE)\n",
      " \n",
      " \n",
      " "
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "In this notebook we compare different models that have inference that is robust to clustered errors, as they usually occur in longitudinal or repeated measures data.\n",
      "\n",
      "In this model we can estimate the parameters of the mean model consistently (estimates converge to true parameters in large samples) even if the correlation structure of the error term or the distributions is incorrect. However, although the parameter estimates are consistent, the estimate of their covariance is not. If we use the usual standard errors for the parameter estimates that are based on  the assumption that we have independently and identically distributed (iid) observations, then those standard errors can be very misleading if our data is not close to satisfying the *iid* assumption.\n",
      "\n",
      "The commonly used solution is to correct the standard errors for un- or mis-specified heteroscedasticity or correlation in the distributions of the data. In the case of longitudinal data where we observe an individual or a group repeately, a frequent problem is that observations for the same individual or group are correlated with each other. Ignoring this correlation can produce misleading inference as we can see in the following example.\n",
      "\n",
      "The next parts use several models from statsmodels to estimate parameters and standard errors. The last section compares the results."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# -*- coding: utf-8 -*-\n",
      "\"\"\"\n",
      "Created on Sun Aug 17 21:29:39 2014\n",
      "\n",
      "Author: Josef Perktold, based on GEE example by Kerby Shedden\n",
      "\n",
      "\"\"\"\n",
      "\n",
      "import numpy as np\n",
      "import pandas as pd\n",
      "from statsmodels.genmod.generalized_linear_model import GLM\n",
      "from statsmodels.genmod.generalized_estimating_equations import GEE\n",
      "from statsmodels.genmod.dependence_structures import (Exchangeable,\n",
      "    Independence, Autoregressive)\n",
      "from statsmodels.genmod import families"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 1
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "data_url = \"http://vincentarelbundock.github.io/Rdatasets/csv/MASS/epil.csv\"\n",
      "data = pd.read_csv(data_url)\n",
      "print data.head()"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "   Unnamed: 0  y      trt  base  age  V4  subject  period     lbase      lage\n",
        "0           1  5  placebo    11   31   0        1       1 -0.756354  0.114204\n",
        "1           2  3  placebo    11   31   0        1       2 -0.756354  0.114204\n",
        "2           3  3  placebo    11   31   0        1       3 -0.756354  0.114204\n",
        "3           4  3  placebo    11   31   1        1       4 -0.756354  0.114204\n",
        "4           5  3  placebo    11   30   0        2       1 -0.756354  0.081414\n"
       ]
      }
     ],
     "prompt_number": 2
    },
    {
     "cell_type": "heading",
     "level": 2,
     "metadata": {},
     "source": [
      "GEE with Independence correlation structure"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "fam = families.Poisson()\n",
      "ind = Independence()\n",
      "mod1 = GEE.from_formula(\"y ~ age + trt + base\", \"subject\", data, cov_struct=ind, family=fam)\n",
      "rslt1 = mod1.fit()\n",
      "print '\\n'\n",
      "print rslt1.summary()"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "\n",
        "\n",
        "                               GEE Regression Results                              \n",
        "===================================================================================\n",
        "Dep. Variable:                           y   No. Observations:                  236\n",
        "Model:                                 GEE   No. clusters:                       59\n",
        "Method:                        Generalized   Min. cluster size:                   4\n",
        "                      Estimating Equations   Max. cluster size:                   4\n",
        "Family:                            Poisson   Mean cluster size:                 4.0\n",
        "Dependence structure:         Independence   Num. iterations:                    51\n",
        "Date:                     Tue, 19 Aug 2014   Scale:                           5.087\n",
        "Covariance type:                    robust   Time:                         11:22:35\n",
        "====================================================================================\n",
        "                       coef    std err          z      P>|z|      [95.0% Conf. Int.]\n",
        "------------------------------------------------------------------------------------\n",
        "Intercept            0.5730      0.361      1.589      0.112        -0.134     1.280\n",
        "trt[T.progabide]    -0.1519      0.171     -0.888      0.375        -0.487     0.183\n",
        "age                  0.0223      0.011      1.960      0.050      2.11e-06     0.045\n",
        "base                 0.0226      0.001     18.451      0.000         0.020     0.025\n",
        "==============================================================================\n",
        "Skew:                          3.7823   Kurtosis:                      28.6672\n",
        "Centered skew:                 2.7597   Centered kurtosis:             21.9865\n",
        "==============================================================================\n"
       ]
      }
     ],
     "prompt_number": 3
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "rslt1.params"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 4,
       "text": [
        "array([ 0.57304359, -0.15188049,  0.02234757,  0.02263524])"
       ]
      }
     ],
     "prompt_number": 4
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "rslt1"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 5,
       "text": [
        "<statsmodels.genmod.generalized_estimating_equations.GEEResults at 0x6f54fb0>"
       ]
      }
     ],
     "prompt_number": 5
    },
    {
     "cell_type": "heading",
     "level": 2,
     "metadata": {},
     "source": [
      "GLM with cluster robust standard errors"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "mod2 = GLM.from_formula(\"y ~ age + trt + base\", data, family=fam)\n",
      "groups = data[\"subject\"]\n",
      "\n",
      "\n",
      "rslt2 = mod2.fit(cov_type='cluster', cov_kwds=dict(groups=groups))\n",
      "print '\\n'\n",
      "print(rslt2.summary())"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "\n",
        "\n",
        "                 Generalized Linear Model Regression Results                  \n",
        "==============================================================================\n",
        "Dep. Variable:                      y   No. Observations:                  236\n",
        "Model:                            GLM   Df Residuals:                      232\n",
        "Model Family:                 Poisson   Df Model:                            3\n",
        "Link Function:                    log   Scale:                             1.0\n",
        "Method:                          IRLS   Log-Likelihood:                -861.39\n",
        "Date:                Tue, 19 Aug 2014   Deviance:                       906.98\n",
        "Time:                        11:22:35   Pearson chi2:                 1.18e+03\n",
        "No. Iterations:                     8                                         \n",
        "====================================================================================\n",
        "                       coef    std err          z      P>|z|      [95.0% Conf. Int.]\n",
        "------------------------------------------------------------------------------------\n",
        "Intercept            0.5730      0.366      1.565      0.118        -0.145     1.291\n",
        "trt[T.progabide]    -0.1519      0.174     -0.875      0.382        -0.492     0.188\n",
        "age                  0.0223      0.012      1.931      0.053        -0.000     0.045\n",
        "base                 0.0226      0.001     18.177      0.000         0.020     0.025\n",
        "====================================================================================\n"
       ]
      }
     ],
     "prompt_number": 6
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "rslt_glm = mod2.fit()\n",
      "print '\\n'\n",
      "print(rslt_glm.summary())"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "\n",
        "\n",
        "                 Generalized Linear Model Regression Results                  \n",
        "==============================================================================\n",
        "Dep. Variable:                      y   No. Observations:                  236\n",
        "Model:                            GLM   Df Residuals:                      232\n",
        "Model Family:                 Poisson   Df Model:                            3\n",
        "Link Function:                    log   Scale:                             1.0\n",
        "Method:                          IRLS   Log-Likelihood:                -861.39\n",
        "Date:                Tue, 19 Aug 2014   Deviance:                       906.98\n",
        "Time:                        11:22:35   Pearson chi2:                 1.18e+03\n",
        "No. Iterations:                     8                                         \n",
        "====================================================================================\n",
        "                       coef    std err          z      P>|z|      [95.0% Conf. Int.]\n",
        "------------------------------------------------------------------------------------\n",
        "Intercept            0.5730      0.136      4.224      0.000         0.307     0.839\n",
        "trt[T.progabide]    -0.1519      0.048     -3.175      0.001        -0.246    -0.058\n",
        "age                  0.0223      0.004      5.549      0.000         0.014     0.030\n",
        "base                 0.0226      0.001     44.436      0.000         0.022     0.024\n",
        "====================================================================================\n"
       ]
      }
     ],
     "prompt_number": 7
    },
    {
     "cell_type": "heading",
     "level": 2,
     "metadata": {},
     "source": [
      "Poisson with cluster robust standard errors"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "In the following we try the same analysis with Poisson from the `discrete_models`. \n",
      "\n",
      "Note: I switched to using `'bfgs'` as optimizers because the default `'newton'` method did not converge correctly."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "import statsmodels.discrete.discrete_model as dm\n",
      "\n",
      "mod3 = dm.Poisson.from_formula(\"y ~ age + trt + base\", data)\n",
      "\n",
      "rslt3 = mod3.fit(method='bfgs', cov_type='cluster', cov_kwds=dict(groups=groups))\n",
      "print '\\n'\n",
      "print(rslt3.summary())"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Optimization terminated successfully.\n",
        "         Current function value: 3.649945\n",
        "         Iterations: 16\n",
        "         Function evaluations: 25\n",
        "         Gradient evaluations: 25\n",
        "\n",
        "\n",
        "                          Poisson Regression Results                          \n",
        "==============================================================================\n",
        "Dep. Variable:                      y   No. Observations:                  236\n",
        "Model:                        Poisson   Df Residuals:                      232\n",
        "Method:                           MLE   Df Model:                            3\n",
        "Date:                Tue, 19 Aug 2014   Pseudo R-squ.:                  0.4754\n",
        "Time:                        11:22:35   Log-Likelihood:                -861.39\n",
        "converged:                       True   LL-Null:                       -1641.9\n",
        "                                        LLR p-value:                     0.000\n",
        "====================================================================================\n",
        "                       coef    std err          z      P>|z|      [95.0% Conf. Int.]\n",
        "------------------------------------------------------------------------------------\n",
        "Intercept            0.5730      0.366      1.565      0.118        -0.145     1.291\n",
        "trt[T.progabide]    -0.1519      0.174     -0.875      0.382        -0.492     0.188\n",
        "age                  0.0223      0.012      1.931      0.053        -0.000     0.045\n",
        "base                 0.0226      0.001     18.177      0.000         0.020     0.025\n",
        "====================================================================================\n"
       ]
      }
     ],
     "prompt_number": 8
    },
    {
     "cell_type": "heading",
     "level": 2,
     "metadata": {},
     "source": [
      "GLM and Poisson with cluster robust standard errors and t distribution based inference"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "rslt2t = mod2.fit(cov_type='cluster', cov_kwds=dict(groups=groups), use_t=True)\n",
      "print(rslt2t.summary())"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "                 Generalized Linear Model Regression Results                  \n",
        "==============================================================================\n",
        "Dep. Variable:                      y   No. Observations:                  236\n",
        "Model:                            GLM   Df Residuals:                      232\n",
        "Model Family:                 Poisson   Df Model:                            3\n",
        "Link Function:                    log   Scale:                             1.0\n",
        "Method:                          IRLS   Log-Likelihood:                -861.39\n",
        "Date:                Tue, 19 Aug 2014   Deviance:                       906.98\n",
        "Time:                        11:22:35   Pearson chi2:                 1.18e+03\n",
        "No. Iterations:                     8                                         \n",
        "====================================================================================\n",
        "                       coef    std err          t      P>|t|      [95.0% Conf. Int.]\n",
        "------------------------------------------------------------------------------------\n",
        "Intercept            0.5730      0.366      1.565      0.123        -0.160     1.306\n",
        "trt[T.progabide]    -0.1519      0.174     -0.875      0.385        -0.499     0.196\n",
        "age                  0.0223      0.012      1.931      0.058        -0.001     0.046\n",
        "base                 0.0226      0.001     18.177      0.000         0.020     0.025\n",
        "====================================================================================\n"
       ]
      }
     ],
     "prompt_number": 9
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "rslt3t = mod3.fit(method='bfgs', cov_type='cluster', cov_kwds=dict(groups=groups), use_t=True)\n",
      "print(rslt3t.summary())"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Optimization terminated successfully.\n",
        "         Current function value: 3.649945\n",
        "         Iterations: 16\n",
        "         Function evaluations: 25\n",
        "         Gradient evaluations: 25\n",
        "                          Poisson Regression Results                          \n",
        "==============================================================================\n",
        "Dep. Variable:                      y   No. Observations:                  236\n",
        "Model:                        Poisson   Df Residuals:                      232\n",
        "Method:                           MLE   Df Model:                            3\n",
        "Date:                Tue, 19 Aug 2014   Pseudo R-squ.:                  0.4754\n",
        "Time:                        11:22:36   Log-Likelihood:                -861.39\n",
        "converged:                       True   LL-Null:                       -1641.9\n",
        "                                        LLR p-value:                     0.000\n",
        "====================================================================================\n",
        "                       coef    std err          t      P>|t|      [95.0% Conf. Int.]\n",
        "------------------------------------------------------------------------------------\n",
        "Intercept            0.5730      0.366      1.565      0.123        -0.160     1.306\n",
        "trt[T.progabide]    -0.1519      0.174     -0.875      0.385        -0.499     0.196\n",
        "age                  0.0223      0.012      1.931      0.058        -0.001     0.046\n",
        "base                 0.0226      0.001     18.177      0.000         0.020     0.025\n",
        "====================================================================================\n"
       ]
      }
     ],
     "prompt_number": 10
    },
    {
     "cell_type": "heading",
     "level": 2,
     "metadata": {},
     "source": [
      "Comparison"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "The following presents tables to compare our estimates and the resulting inference across our previously estimated models.\n",
      "\n",
      "The models corresponding to the columns in the data are:\n",
      "'GLM non-robust', 'GEE', 'GLM robust', 'Poisson robust', 'GLM robust t', 'Poisson robust t'.\n",
      "\n",
      "First, all models estimate exactly the same values for the mean parameters. All models that we consider here assume the same underlying model and get the same parameters, even though the used estimation and optimization method differs. \n",
      "\n",
      "In terms of inference we get essentially 3 groups of results. \n",
      "\n",
      "The first model is the Poisson model estimated with GLM that assumes the data is generated from the Poisson distribution. The standard errors are relatively small and we reject the null hypothesis that a parameter is zero for all parameters based on the p-values.\n",
      "\n",
      "We can see that in the second group, GEE, Poisson and GLM with the Poisson family, each with cluster robust standard errors, the standard errors are much larger than the non-robust standard errors and based on the pvalues only base is significant and age is at the 0.05 threshold. The p-values and the confidence intervals are based on the asymptotic normal distribution of the estimated parameters. \n",
      "\n",
      "The last two columns show the cluster robust results for GLM-Poisson and Poisson if we use the t-distribution instead of the normal distribution. In the linear case the t distribution often provides a better small sample approximation than the normal distribution.\n",
      "Since we are using the Poisson distribution in this case, we would have to check, for example by Monte Carlo, whether the t-distribution is better than the normal distribution. Here we just use it to illustrate the difference between the two distributional assumptions. Standard errors are independent of the distribution and didn't change. However, the p-values are slightly large than under the normal distribution, and the confidence intervals are wider as can be seen in the summary tables of the models above.\n",
      "\n",
      "The standard errors for the parameters differ between GEE on one side and GLM-Poisson and Poisson on the other side by a small scaling factor, that reflects a different small sample correction or variance scaling. GLM-Poisson and Poisson share the same code for the robust standard errors and produce the same results.\n",
      "\n",
      "The default degrees of freedom, as well as the small sample correction, in the case of cluster robust standard errors are based on the number of clusters and not on the total number of observations. If we had based the degrees of freedom on the number of observations than the results based on t distribution would be closer to those based on the normal distribution.\n"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "res_all = [rslt_glm, rslt1, rslt2, rslt3, rslt2t, rslt3t]\n",
      "res_names = ['GLM non-robust', 'GEE', 'GLM robust', 'Poisson robust', 'GLM robust t', 'Poisson robust t']\n",
      "import pandas\n",
      "# pandas.DataFrame([getattr(res, 'params') for res in res_all], columns=res_names)  # broken with my versions\n",
      "# pandas.DataFrame([pandas.Series(getattr(res, 'params'), index=rslt2.params.index) for res in res_all], columns=res_names)\n",
      "# the second has only nans"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 11
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# detour through numpy\n",
      "print('Parameters')\n",
      "pandas.DataFrame(np.round(np.column_stack([np.asarray(getattr(res, 'params')) for res in res_all]), 4), \n",
      "                 index=rslt2.params.index, columns=res_names)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Parameters\n"
       ]
      },
      {
       "html": [
        "<div style=\"max-height:1000px;max-width:1500px;overflow:auto;\">\n",
        "<table border=\"1\" class=\"dataframe\">\n",
        "  <thead>\n",
        "    <tr style=\"text-align: right;\">\n",
        "      <th></th>\n",
        "      <th>GLM non-robust</th>\n",
        "      <th>GEE</th>\n",
        "      <th>GLM robust</th>\n",
        "      <th>Poisson robust</th>\n",
        "      <th>GLM robust t</th>\n",
        "      <th>Poisson robust t</th>\n",
        "    </tr>\n",
        "  </thead>\n",
        "  <tbody>\n",
        "    <tr>\n",
        "      <th>Intercept</th>\n",
        "      <td> 0.5730</td>\n",
        "      <td> 0.5730</td>\n",
        "      <td> 0.5730</td>\n",
        "      <td> 0.5730</td>\n",
        "      <td> 0.5730</td>\n",
        "      <td> 0.5730</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>trt[T.progabide]</th>\n",
        "      <td>-0.1519</td>\n",
        "      <td>-0.1519</td>\n",
        "      <td>-0.1519</td>\n",
        "      <td>-0.1519</td>\n",
        "      <td>-0.1519</td>\n",
        "      <td>-0.1519</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>age</th>\n",
        "      <td> 0.0223</td>\n",
        "      <td> 0.0223</td>\n",
        "      <td> 0.0223</td>\n",
        "      <td> 0.0223</td>\n",
        "      <td> 0.0223</td>\n",
        "      <td> 0.0223</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>base</th>\n",
        "      <td> 0.0226</td>\n",
        "      <td> 0.0226</td>\n",
        "      <td> 0.0226</td>\n",
        "      <td> 0.0226</td>\n",
        "      <td> 0.0226</td>\n",
        "      <td> 0.0226</td>\n",
        "    </tr>\n",
        "  </tbody>\n",
        "</table>\n",
        "</div>"
       ],
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 12,
       "text": [
        "                  GLM non-robust     GEE  GLM robust  Poisson robust  \\\n",
        "Intercept                 0.5730  0.5730      0.5730          0.5730   \n",
        "trt[T.progabide]         -0.1519 -0.1519     -0.1519         -0.1519   \n",
        "age                       0.0223  0.0223      0.0223          0.0223   \n",
        "base                      0.0226  0.0226      0.0226          0.0226   \n",
        "\n",
        "                  GLM robust t  Poisson robust t  \n",
        "Intercept               0.5730            0.5730  \n",
        "trt[T.progabide]       -0.1519           -0.1519  \n",
        "age                     0.0223            0.0223  \n",
        "base                    0.0226            0.0226  "
       ]
      }
     ],
     "prompt_number": 12
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "print('Standard Errors')\n",
      "pandas.DataFrame(np.round(np.column_stack([np.asarray(getattr(res, 'bse')) for res in res_all]), 4), \n",
      "                 index=rslt2.params.index, columns=res_names)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Standard Errors\n"
       ]
      },
      {
       "html": [
        "<div style=\"max-height:1000px;max-width:1500px;overflow:auto;\">\n",
        "<table border=\"1\" class=\"dataframe\">\n",
        "  <thead>\n",
        "    <tr style=\"text-align: right;\">\n",
        "      <th></th>\n",
        "      <th>GLM non-robust</th>\n",
        "      <th>GEE</th>\n",
        "      <th>GLM robust</th>\n",
        "      <th>Poisson robust</th>\n",
        "      <th>GLM robust t</th>\n",
        "      <th>Poisson robust t</th>\n",
        "    </tr>\n",
        "  </thead>\n",
        "  <tbody>\n",
        "    <tr>\n",
        "      <th>Intercept</th>\n",
        "      <td> 0.1357</td>\n",
        "      <td> 0.3607</td>\n",
        "      <td> 0.3662</td>\n",
        "      <td> 0.3662</td>\n",
        "      <td> 0.3662</td>\n",
        "      <td> 0.3662</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>trt[T.progabide]</th>\n",
        "      <td> 0.0478</td>\n",
        "      <td> 0.1711</td>\n",
        "      <td> 0.1736</td>\n",
        "      <td> 0.1736</td>\n",
        "      <td> 0.1736</td>\n",
        "      <td> 0.1736</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>age</th>\n",
        "      <td> 0.0040</td>\n",
        "      <td> 0.0114</td>\n",
        "      <td> 0.0116</td>\n",
        "      <td> 0.0116</td>\n",
        "      <td> 0.0116</td>\n",
        "      <td> 0.0116</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>base</th>\n",
        "      <td> 0.0005</td>\n",
        "      <td> 0.0012</td>\n",
        "      <td> 0.0012</td>\n",
        "      <td> 0.0012</td>\n",
        "      <td> 0.0012</td>\n",
        "      <td> 0.0012</td>\n",
        "    </tr>\n",
        "  </tbody>\n",
        "</table>\n",
        "</div>"
       ],
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 13,
       "text": [
        "                  GLM non-robust     GEE  GLM robust  Poisson robust  \\\n",
        "Intercept                 0.1357  0.3607      0.3662          0.3662   \n",
        "trt[T.progabide]          0.0478  0.1711      0.1736          0.1736   \n",
        "age                       0.0040  0.0114      0.0116          0.0116   \n",
        "base                      0.0005  0.0012      0.0012          0.0012   \n",
        "\n",
        "                  GLM robust t  Poisson robust t  \n",
        "Intercept               0.3662            0.3662  \n",
        "trt[T.progabide]        0.1736            0.1736  \n",
        "age                     0.0116            0.0116  \n",
        "base                    0.0012            0.0012  "
       ]
      }
     ],
     "prompt_number": 13
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "print('P-values')\n",
      "pandas.DataFrame(np.round(np.column_stack([np.asarray(getattr(res, 'pvalues')) for res in res_all]), 4), \n",
      "                 index=rslt2.params.index, columns=res_names)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "P-values\n"
       ]
      },
      {
       "html": [
        "<div style=\"max-height:1000px;max-width:1500px;overflow:auto;\">\n",
        "<table border=\"1\" class=\"dataframe\">\n",
        "  <thead>\n",
        "    <tr style=\"text-align: right;\">\n",
        "      <th></th>\n",
        "      <th>GLM non-robust</th>\n",
        "      <th>GEE</th>\n",
        "      <th>GLM robust</th>\n",
        "      <th>Poisson robust</th>\n",
        "      <th>GLM robust t</th>\n",
        "      <th>Poisson robust t</th>\n",
        "    </tr>\n",
        "  </thead>\n",
        "  <tbody>\n",
        "    <tr>\n",
        "      <th>Intercept</th>\n",
        "      <td> 0.0000</td>\n",
        "      <td> 0.1122</td>\n",
        "      <td> 0.1176</td>\n",
        "      <td> 0.1176</td>\n",
        "      <td> 0.1230</td>\n",
        "      <td> 0.1230</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>trt[T.progabide]</th>\n",
        "      <td> 0.0015</td>\n",
        "      <td> 0.3746</td>\n",
        "      <td> 0.3817</td>\n",
        "      <td> 0.3817</td>\n",
        "      <td> 0.3853</td>\n",
        "      <td> 0.3853</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>age</th>\n",
        "      <td> 0.0000</td>\n",
        "      <td> 0.0500</td>\n",
        "      <td> 0.0535</td>\n",
        "      <td> 0.0535</td>\n",
        "      <td> 0.0584</td>\n",
        "      <td> 0.0584</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>base</th>\n",
        "      <td> 0.0000</td>\n",
        "      <td> 0.0000</td>\n",
        "      <td> 0.0000</td>\n",
        "      <td> 0.0000</td>\n",
        "      <td> 0.0000</td>\n",
        "      <td> 0.0000</td>\n",
        "    </tr>\n",
        "  </tbody>\n",
        "</table>\n",
        "</div>"
       ],
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 14,
       "text": [
        "                  GLM non-robust     GEE  GLM robust  Poisson robust  \\\n",
        "Intercept                 0.0000  0.1122      0.1176          0.1176   \n",
        "trt[T.progabide]          0.0015  0.3746      0.3817          0.3817   \n",
        "age                       0.0000  0.0500      0.0535          0.0535   \n",
        "base                      0.0000  0.0000      0.0000          0.0000   \n",
        "\n",
        "                  GLM robust t  Poisson robust t  \n",
        "Intercept               0.1230            0.1230  \n",
        "trt[T.progabide]        0.3853            0.3853  \n",
        "age                     0.0584            0.0584  \n",
        "base                    0.0000            0.0000  "
       ]
      }
     ],
     "prompt_number": 14
    },
    {
     "cell_type": "heading",
     "level": 2,
     "metadata": {},
     "source": [
      "Appendix: Convariance Types in GEE"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "#covariance_type='robust'"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 15
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# BUG see https://github.com/statsmodels/statsmodels/issues/1906\n",
      "mod1b = GEE.from_formula(\"y ~ age + trt + base\", \"subject\", data, cov_struct=ind, family=fam)\n",
      "rslt1b = mod1b.fit(covariance_type='naive')\n",
      "rslt1b_ = mod1b.fit(covariance_type='naive')\n",
      "print '\\n'\n",
      "print rslt1b.summary()"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "\n",
        "\n",
        "                               GEE Regression Results                              \n",
        "===================================================================================\n",
        "Dep. Variable:                           y   No. Observations:                  236\n",
        "Model:                                 GEE   No. clusters:                       59\n",
        "Method:                        Generalized   Min. cluster size:                   4\n",
        "                      Estimating Equations   Max. cluster size:                   4\n",
        "Family:                            Poisson   Mean cluster size:                 4.0\n",
        "Dependence structure:         Independence   Num. iterations:                    51\n",
        "Date:                     Tue, 19 Aug 2014   Scale:                           5.087\n",
        "Covariance type:                    robust   Time:                         12:13:20\n",
        "====================================================================================\n",
        "                       coef    std err          z      P>|z|      [95.0% Conf. Int.]\n",
        "------------------------------------------------------------------------------------\n",
        "Intercept            0.5730      0.361      1.589      0.112        -0.134     1.280\n",
        "trt[T.progabide]    -0.1519      0.171     -0.888      0.375        -0.487     0.183\n",
        "age                  0.0223      0.011      1.960      0.050      2.11e-06     0.045\n",
        "base                 0.0226      0.001     18.451      0.000         0.020     0.025\n",
        "==============================================================================\n",
        "Skew:                          3.7823   Kurtosis:                      28.6672\n",
        "Centered skew:                 2.7597   Centered kurtosis:             21.9865\n",
        "==============================================================================\n"
       ]
      }
     ],
     "prompt_number": 30
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "#print(rslt1b_.summary())"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 31
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "rslt1b.covariance_type, rslt1b_.covariance_type"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 32,
       "text": [
        "('robust', 'naive')"
       ]
      }
     ],
     "prompt_number": 32
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "rslt1b.bse, rslt1b_.bse"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 39,
       "text": [
        "(array([ 0.36072614,  0.17105111,  0.01140096,  0.00122675]),\n",
        " array([ 0.30598606,  0.10789098,  0.00908399,  0.00114894]))"
       ]
      }
     ],
     "prompt_number": 39
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "rslt1b_.bse - rslt1.bse"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 34,
       "text": [
        "array([ -5.47400849e-02,  -6.31601337e-02,  -2.31696404e-03,\n",
        "        -7.78130388e-05])"
       ]
      }
     ],
     "prompt_number": 34
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "bse_naive = np.sqrt(np.diag(rslt1b.naive_covariance))\n",
      "print(bse_naive)\n",
      "print(rslt_glm.bse.values)\n",
      "#np.sqrt(np.diag(rslt1c.robust_covariance_bc))\n",
      "print(bse_naive / rslt_glm.bse.values)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "[ 0.30598606  0.10789098  0.00908399  0.00114894]\n",
        "[ 0.1356608   0.04783413  0.00402744  0.00050939]\n",
        "[ 2.25552299  2.25552299  2.25552299  2.25552299]\n"
       ]
      }
     ],
     "prompt_number": 36
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "np.sqrt(rslt_glm.pearson_chi2 / rslt_glm.model.df_resid)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "mod1c = GEE.from_formula(\"y ~ age + trt + base\", \"subject\", data, cov_struct=ind, family=fam)\n",
      "rslt1c = mod1c.fit(covariance_type='bias-reduced')\n",
      "print '\\n'\n",
      "print rslt1c.summary()"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "rslt1c.bse - rslt1.bse"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "np.sqrt(np.diag(rslt1c.robust_covariance_bc))"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "rslt1c.bse\n"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "rslt1b.bse"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "rslt1b.covariance_type"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "rslt1b.covariance_type = 'naive'\n",
      "del rslt1b._cache['bse']"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "rslt1b.bse"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "rslt1b_ = mod1b.fit(covariance_type='naive')"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "rslt1b_.covariance_type"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "rslt1b_.bse"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [],
     "language": "python",
     "metadata": {},
     "outputs": []
    }
   ],
   "metadata": {}
  }
 ]
}