bwallace/more-irony

## more-irony
{
 "metadata": {
  "name": "irony-context"
 },
 "nbformat": 2,
 "worksheets": [
  {
   "cells": [
    {
     "cell_type": "markdown",
     "source": [
      "# Exploiting context for verbal irony classification #",
      "",
      "## Crude model of expectations ##",
      "",
      "Induce a probabalistic classifier (logistic regression) to discriminate posts in the *Conservative* vs *progressive* subreddits -- training data for this task is essentially free. See the code for this in the `pretense` method. ",
      "",
      "Below output is shown for two stratified regressions: one for the *Conservative* subreddit and one for the *progressive* subreddit. In each we're testing the correlation of the *predicted* probability that a comment was posted in the *progressive* subreddit and ironic intent. "
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "cd '/Users/bwallace/dev/computational-irony/irony-redux'"
     ],
     "language": "python",
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "/Users/bwallace/dev/computational-irony/irony-redux"
       ]
      }
     ],
     "prompt_number": 29
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "import annotation_stats"
     ],
     "language": "python",
     "outputs": [],
     "prompt_number": 10
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "reload(annotation_stats);"
     ],
     "language": "python",
     "outputs": [],
     "prompt_number": 24
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "annotation_stats.pretense()"
     ],
     "language": "python",
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Optimization terminated successfully.",
        "         Current function value: 0.518405",
        "         Iterations 6",
        " --- conservative ---",
        "                           Logit Regression Results                           ",
        "==============================================================================",
        "Dep. Variable:                      y   No. Observations:                  666",
        "Model:                          Logit   Df Residuals:                      664",
        "Method:                           MLE   Df Model:                            1",
        "Date:                Tue, 11 Mar 2014   Pseudo R-squ.:                0.003329",
        "Time:                        11:31:41   Log-Likelihood:                -345.26",
        "converged:                       True   LL-Null:                       -346.41",
        "                                        LLR p-value:                    0.1288",
        "==============================================================================",
        "                 coef    std err          z      P>|z|      [95.0% Conf. Int.]",
        "------------------------------------------------------------------------------",
        "const         -2.6908      0.952     -2.827      0.005        -4.557    -0.825",
        "x1             3.6371      2.460      1.478      0.139        -1.185     8.459",
        "==============================================================================",
        "Optimization terminated successfully."
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "",
        "         Current function value: 0.553258",
        "         Iterations 5",
        "",
        " --- liberal ---",
        "                           Logit Regression Results                           ",
        "==============================================================================",
        "Dep. Variable:                      y   No. Observations:                  635",
        "Model:                          Logit   Df Residuals:                      633",
        "Method:                           MLE   Df Model:                            1",
        "Date:                Tue, 11 Mar 2014   Pseudo R-squ.:                0.004490",
        "Time:                        11:31:41   Log-Likelihood:                -351.32",
        "converged:                       True   LL-Null:                       -352.90",
        "                                        LLR p-value:                   0.07504",
        "==============================================================================",
        "                 coef    std err          z      P>|z|      [95.0% Conf. Int.]",
        "------------------------------------------------------------------------------",
        "const          1.2382      1.332      0.930      0.353        -1.372     3.848",
        "x1            -3.9574      2.224     -1.780      0.075        -8.316     0.401",
        "=============================================================================="
       ]
      }
     ],
     "prompt_number": 30
    },
    {
     "cell_type": "markdown",
     "source": [
      "The coefficients are in the expected direction, but not significant (at least for the *Conservative* subreddit), where the point estimate is 3.63 with CI (-1.185, 8.459). This says, though, that the more 'liberal' sounding a comment is in the conservative subreddit, the more likely it is to have been intended ironically. Results are similar for the *progressive* subreddit, and the correlation is even significant at the 0.1 level. In particular, the more liberal a comment in the liberal subreddit seems, the less likely it is to have been intended ironically.",
      "",
      "## Exploiting this ##",
      "",
      "Moving back to the ML side, we'd like to exploit this predictor. To this end I introduced 3 features (in addition to the standard BoW features). One feature indicates whether the post was in the *proressive* subreddit, while the other two are *interaction* terms that cross the subreddit with the predicted probability of a comment being from the *progressive* reddit."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "reload(annotation_stats);"
     ],
     "language": "python",
     "outputs": [],
     "prompt_number": 49
    },
    {
     "cell_type": "markdown",
     "source": [
      "### Without these features (baseline) ###"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "annotation_stats.pretense_experiment(use_pretense=False)"
     ],
     "language": "python",
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "building progressive/conservative model!",
        "",
        "recalls:"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "",
        "[0.8275862068965517, 1.0, 0.6470588235294118, 1.0, 0.8333333333333334]",
        "average: 0.861595672752",
        "",
        "precisions:",
        "[0.26666666666666666, 0.248868778280543, 0.29464285714285715, 0.24545454545454545, 0.32894736842105265]",
        "average: 0.276916043193",
        "",
        "F1s:",
        "[0.4033613445378151, 0.39855072463768115, 0.40490797546012275, 0.39416058394160586, 0.4716981132075471]",
        "average: 0.414535748357"
       ]
      }
     ],
     "prompt_number": 27
    },
    {
     "cell_type": "markdown",
     "source": [
      "### With these features ###"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "annotation_stats.pretense_experiment(use_pretense=True)"
     ],
     "language": "python",
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "building progressive/conservative model!",
        "",
        "recalls:"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "",
        "[0.5, 1.0, 0.4117647058823529, 1.0, 0.45]",
        "average: 0.672352941176",
        "",
        "precisions:",
        "[0.27102803738317754, 0.24663677130044842, 0.21649484536082475, 0.24545454545454545, 0.23275862068965517]",
        "average: 0.242474564038",
        "",
        "F1s:",
        "[0.3515151515151515, 0.39568345323741005, 0.28378378378378377, 0.39416058394160586, 0.3068181818181818]",
        "average: 0.346392230859"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stderr",
       "text": [
        "/Library/Frameworks/Python.framework/Versions/7.3/lib/python2.7/site-packages/scipy/sparse/compressed.py:486: SparseEfficiencyWarning: changing the sparsity structure of a csr_matrix is expensive. lil_matrix is more efficient.",
        "  SparseEfficiencyWarning)"
       ]
      }
     ],
     "prompt_number": 28
    },
    {
     "cell_type": "markdown",
     "source": [
      "In other words, this unfortunately performs *worse*! However, the additional features *are* predictive (or seem to be)! Consider",
      "",
      "`print clf.coef_[0][liberal_j]`",
      "",
      "`1.15580604539`",
      "",
      "Which says that a relatively large weight is associated with 1-[prob liberal comment(i)] given that is in the liberal subreddit",
      "",
      "Uncertain why this performs worse, despite the feature ostensibly being informative!"
     ]
    },
    {
     "cell_type": "markdown",
     "source": [
      "# Regrouping #",
      "",
      "I think the strategy to take is two-fold:",
      "",
      "1. Move prediction to sentence-level.",
      "2. Evaluate by taking into consideration multiple annotations (papers on this?)",
      "2.a. Maybe just add multiple copies of each instance (one per annotator) with assocaited label?",
      "",
      ""
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "reload(annotation_stats);"
     ],
     "language": "python",
     "outputs": [],
     "prompt_number": 53
    },
    {
     "cell_type": "markdown",
     "source": [
      "First, I've moved to using the latest database (more labels). I've also implemented a baseline strategy that will guess randomly (with probability equal to the training proportion) that utterances are ironic. Let's see how that does.",
      "",
      "### baseline (random) ###"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "annotation_stats.pretense_experiment(model=\"baseline\", use_pretense=False)"
     ],
     "language": "python",
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "building progressive/conservative model!",
        "----------"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "",
        "baseline",
        "----------",
        "",
        "recalls:",
        "[0.20408163265306123, 0.26136363636363635, 0.22727272727272727, 0.21052631578947367, 0.24752475247524752]",
        "average: 0.230153812911",
        "",
        "precisions:",
        "[0.2222222222222222, 0.23232323232323232, 0.21052631578947367, 0.21978021978021978, 0.25773195876288657]",
        "average: 0.228516789776",
        "",
        "F1s:",
        "[0.2127659574468085, 0.2459893048128342, 0.21857923497267756, 0.21505376344086022, 0.25252525252525254]",
        "average: 0.22898270264",
        "",
        "AUCs:",
        "[0.2487828474292407, 0.24105211920592637, 0.23835961421097912, 0.2350927955831317, 0.29480698635592373]",
        "average: 0.251618872557",
        "",
        "average kappas:",
        "[-0.023785813439299915, 0.00630464591126308, 0.00948435849008785, -0.01311975524123987, 0.0027460928460390265]",
        "average average kappas: -0.00367409428663"
       ]
      }
     ],
     "prompt_number": 44
    },
    {
     "cell_type": "markdown",
     "source": [
      "So *baseline* gives an average F1 of .229 and an average pairwise kappa of -.004. Let's consider our baseline SVM approaches (no 'pretense'/context). First, we'll use SGD.",
      "",
      "### SGD ###"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "annotation_stats.pretense_experiment(model=\"SGD\")"
     ],
     "language": "python",
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "building progressive/conservative model!",
        "----------"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "",
        "SGD",
        "----------",
        "",
        "recalls:",
        "[0.32653061224489793, 0.6931818181818182, 0.8863636363636364, 0.49473684210526314, 0.9207920792079208]",
        "average: 0.664320997621",
        "",
        "precisions:",
        "[0.4, 0.27232142857142855, 0.25573770491803277, 0.34306569343065696, 0.28703703703703703]",
        "average: 0.311632372791",
        "",
        "F1s:",
        "[0.35955056179775274, 0.391025641025641, 0.3969465648854962, 0.4051724137931035, 0.43764705882352944]",
        "average: 0.398068448065",
        "",
        "AUCs:",
        "[0.35818002214454642, 0.31774591776953282, 0.32872691560169881, 0.35425753115781783, 0.34646571223040229]",
        "average: 0.341075219781",
        "",
        "average kappas:",
        "[0.1293372255888978, 0.0682644280882094, 0.03839949825357625, 0.06430445302607776, 0.026188986920469955]",
        "average average kappas: 0.0652989183754"
       ]
      }
     ],
     "prompt_number": 45
    },
    {
     "cell_type": "markdown",
     "source": [
      "So a clear improvement, at least, providing some sanity. ~.4 F1 and .065 kappa. ",
      "",
      "Sidenote on kappas, these are always going to be low when prevalence is low (as it is here), so these might not be as bad as they 'seem' (although admittedly, they're pretty bad: average pairwise human kappas are like .34 or something). It's clear that we're going to have think harder about metrics here. ",
      "",
      "Let's consider the linearSVC, which really should perform comparably.",
      "",
      "### LinearSVC ###"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "annotation_stats.pretense_experiment(model=\"SVC\")"
     ],
     "language": "python",
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "building progressive/conservative model!",
        "----------"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "",
        "SVC",
        "----------",
        "",
        "recalls:",
        "[0.2857142857142857, 0.3181818181818182, 0.375, 0.29473684210526313, 0.26732673267326734]",
        "average: 0.308191935735",
        "",
        "precisions:",
        "[0.3835616438356164, 0.35, 0.4024390243902439, 0.35443037974683544, 0.38571428571428573]",
        "average: 0.375229066737",
        "",
        "F1s:",
        "[0.32748538011695905, 0.3333333333333333, 0.3882352941176471, 0.32183908045977017, 0.31578947368421056]",
        "average: 0.337336512342",
        "",
        "AUCs:",
        "[0.36141385415903493, 0.31786163155807978, 0.32883497088741004, 0.35446682340522445, 0.34679745021447139]",
        "average: 0.341874946045",
        "",
        "average kappas:",
        "[0.1002457720814471, 0.09435598354585739, 0.15129435623787119, 0.039703416427287114, 0.10264635996727187]",
        "average average kappas: 0.0976491776519"
       ]
      }
     ],
     "prompt_number": 46
    },
    {
     "cell_type": "markdown",
     "source": [
      "So the first thing of note is that the F1s are notably different (0.34 here versus ~.4 above), which is slightly disconcerting still, although not as extreme as we saw before. SVC weirdly improves kappas, which is interesting. The other thing I would note that I find reassuring is that the **AUCs are practically identical** -- perhaps a lot depends simply on where you draw your threshold.",
      "",
      "Now, for some context.",
      "",
      "### PRETENSE ###",
      "",
      "First let's just add one feature: the predicted probability of an utterance having been intended ironically"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "annotation_stats.pretense_experiment(model=\"SGD\", use_pretense=True, interaction_features=False)"
     ],
     "language": "python",
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "building progressive/conservative model!",
        "----------"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "",
        "SGD",
        "----------",
        "",
        "recalls:",
        "[0.8775510204081632, 0.9318181818181818, 0.3522727272727273, 1.0, 0.7722772277227723]",
        "average: 0.786783831444",
        "",
        "precisions:",
        "[0.27388535031847133, 0.23631123919308358, 0.4025974025974026, 0.25885558583106266, 0.3132530120481928]",
        "average: 0.296980517998",
        "",
        "F1s:",
        "[0.4174757281553398, 0.37701149425287356, 0.3757575757575758, 0.41125541125541126, 0.4457142857142857]",
        "average: 0.405442899027",
        "",
        "AUCs:",
        "[0.33671307977713638, 0.29784249365180226, 0.33241860558960229, 0.27943621917281503, 0.34587712510096663]",
        "average: 0.318457504658",
        "",
        "average kappas:",
        "[0.011544960147504276, -0.00019105189665967297, 0.14907073654989755, 0.005039101166456329, 0.060664372033046206]",
        "average average kappas: 0.0452256236"
       ]
      }
     ],
     "prompt_number": 51
    },
    {
     "cell_type": "markdown",
     "source": [
      "A mixed bag. Massive gains in terms of F1 (?!) but at the expense of AUC (and kappa). It looks like we lost a bunch of precision."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "annotation_stats.pretense_experiment(model=\"SGD\", use_pretense=True, interaction_features=True)"
     ],
     "language": "python",
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "building progressive/conservative model!",
        "----------"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "",
        "SGD",
        "----------",
        "",
        "recalls:",
        "[0.40816326530612246, 0.5454545454545454, 0.5568181818181818, 0.45263157894736844, 0.6831683168316832]",
        "average: 0.529247177672",
        "",
        "precisions:",
        "[0.23809523809523808, 0.24742268041237114, 0.2677595628415301, 0.24157303370786518, 0.3165137614678899]",
        "average: 0.262272855305",
        "",
        "F1s:",
        "[0.30075187969924805, 0.3404255319148936, 0.3616236162361624, 0.315018315018315, 0.43260188087774293]",
        "average: 0.350084244749",
        "",
        "AUCs:",
        "[0.32969153263719136, 0.27821444144388791, 0.35271999025127515, 0.261655108701911, 0.34839131192164463]",
        "average: 0.314134476991",
        "",
        "average kappas:",
        "[-0.03521849489188062, 0.01840374284147445, 0.018843041275619674, -0.019045570083904603, 0.04343870334834959]",
        "average average kappas: 0.00528428449793"
       ]
      }
     ],
     "prompt_number": 54
    },
    {
     "cell_type": "markdown",
     "source": [
      "Almost suspiciously bad -- I wonder if I haven't futzed something up with my creation of the interaction features. ",
      "",
      "But I don't think so, in part because these terms seem really, really predictive, e.g.,:",
      "",
      "print clf.best_estimator_.coef_[0,liberal_j]",
      "",
      "0.263161208169",
      "",
      "print clf.best_estimator_.coef_[0,liberal_j-1]",
      "",
      "0.882066517978",
      "",
      "print clf.best_estimator_.coef_[0,conservative_j]",
      "",
      "-0.649088581092",
      "",
      "So the *best* predictor (biggest weight in learned vector) is the `liberal intercept', which says that the associated comment was in the liberal subreddit; this is a positive weight, suggesting liberals are more likely to be ironic. This makes sense, I guess. **But**, confusingly, weights corresponding to the predicted probability that an utterance belongs to the liberal subreddit but is in the conservative subreddit is *negative*?!?!",
      "",
      "Sanity check on most predictive features from progressive/conservative classifier (1 being progressive):",
      "        ",
      "\t-2.8326\tobama          \t\t1.5667\tprogressive    ",
      "\t-1.2036\tobamacare      \t\t1.2391\tgop            ",
      "\t-1.0678\tdemocrats      \t\t1.2323\trepublican     ",
      "\t-1.0602\tliberal        \t\t1.0716\tcorporate      ",
      "\t-1.0333\tracist         \t\t1.0404\tstate          ",
      "\t-1.0328\tadministration \t\t1.0312\tpaul   ",
      "",
      "So this seems OK. I really don't know what's going on. My conclusion so far is that, weirdly, the models seem to find that these features are predictive, but then it seems to *hurt* predictive performance. Even more concerning, the weights are in unexpected directions! One possibility here is that we need to cross the perceived likelihood of a comment coming from the conservative/progressive subreddit with the perceived sentiment, e.g., using the NNP/sentiment model we discussed.",
      "",
      "I am feeling better about disparities between SVC/SGD, as these differences now seem rather negligible. I'm not sure how to square this confusion with the regression result presented at the outset of this notebook: perhaps the effect is only due to the subreddit (liberals tend to be ironic more)? One idea would be to introduce interaction features crossing words with the subreddit (this would get at whether, e.g., conservatives tend to use certain tokens ironically). Open to other ideas, of course :). Let's move on to *sentence-level* prediction, which may be a better task."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": true,
     "input": [],
     "language": "python",
     "outputs": []
    }
   ]
  }
 ]
}
	{
	"metadata": {
	"name": "irony-context"
	},
	"nbformat": 2,
	"worksheets": [
	{
	"cells": [
	{
	"cell_type": "markdown",
	"source": [
	"# Exploiting context for verbal irony classification #",
	"",
	"## Crude model of expectations ##",
	"",
	"Induce a probabalistic classifier (logistic regression) to discriminate posts in the Conservative vs progressive subreddits -- training data for this task is essentially free. See the code for this in the `pretense` method. ",
	"",
	"Below output is shown for two stratified regressions: one for the Conservative subreddit and one for the progressive subreddit. In each we're testing the correlation of the predicted probability that a comment was posted in the progressive subreddit and ironic intent. "
	]
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": [
	"cd '/Users/bwallace/dev/computational-irony/irony-redux'"
	],
	"language": "python",
	"outputs": [
	{
	"output_type": "stream",
	"stream": "stdout",
	"text": [
	"/Users/bwallace/dev/computational-irony/irony-redux"
	]
	}
	],
	"prompt_number": 29
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": [
	"import annotation_stats"
	],
	"language": "python",
	"outputs": [],
	"prompt_number": 10
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": [
	"reload(annotation_stats);"
	],
	"language": "python",
	"outputs": [],
	"prompt_number": 24
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": [
	"annotation_stats.pretense()"
	],
	"language": "python",
	"outputs": [
	{
	"output_type": "stream",
	"stream": "stdout",
	"text": [
	"Optimization terminated successfully.",
	" Current function value: 0.518405",
	" Iterations 6",
	" --- conservative ---",
	" Logit Regression Results ",
	"==============================================================================",
	"Dep. Variable: y No. Observations: 666",
	"Model: Logit Df Residuals: 664",
	"Method: MLE Df Model: 1",
	"Date: Tue, 11 Mar 2014 Pseudo R-squ.: 0.003329",
	"Time: 11:31:41 Log-Likelihood: -345.26",
	"converged: True LL-Null: -346.41",
	" LLR p-value: 0.1288",
	"==============================================================================",
	" coef std err z P>\|z\| [95.0% Conf. Int.]",
	"------------------------------------------------------------------------------",
	"const -2.6908 0.952 -2.827 0.005 -4.557 -0.825",
	"x1 3.6371 2.460 1.478 0.139 -1.185 8.459",
	"==============================================================================",
	"Optimization terminated successfully."
	]
	},
	{
	"output_type": "stream",
	"stream": "stdout",
	"text": [
	"",
	" Current function value: 0.553258",
	" Iterations 5",
	"",
	" --- liberal ---",
	" Logit Regression Results ",
	"==============================================================================",
	"Dep. Variable: y No. Observations: 635",
	"Model: Logit Df Residuals: 633",
	"Method: MLE Df Model: 1",
	"Date: Tue, 11 Mar 2014 Pseudo R-squ.: 0.004490",
	"Time: 11:31:41 Log-Likelihood: -351.32",
	"converged: True LL-Null: -352.90",
	" LLR p-value: 0.07504",
	"==============================================================================",
	" coef std err z P>\|z\| [95.0% Conf. Int.]",
	"------------------------------------------------------------------------------",
	"const 1.2382 1.332 0.930 0.353 -1.372 3.848",
	"x1 -3.9574 2.224 -1.780 0.075 -8.316 0.401",
	"=============================================================================="
	]
	}
	],
	"prompt_number": 30
	},
	{
	"cell_type": "markdown",
	"source": [
	"The coefficients are in the expected direction, but not significant (at least for the Conservative subreddit), where the point estimate is 3.63 with CI (-1.185, 8.459). This says, though, that the more 'liberal' sounding a comment is in the conservative subreddit, the more likely it is to have been intended ironically. Results are similar for the progressive subreddit, and the correlation is even significant at the 0.1 level. In particular, the more liberal a comment in the liberal subreddit seems, the less likely it is to have been intended ironically.",
	"",
	"## Exploiting this ##",
	"",
	"Moving back to the ML side, we'd like to exploit this predictor. To this end I introduced 3 features (in addition to the standard BoW features). One feature indicates whether the post was in the proressive subreddit, while the other two are interaction terms that cross the subreddit with the predicted probability of a comment being from the progressive reddit."
	]
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": [
	"reload(annotation_stats);"
	],
	"language": "python",
	"outputs": [],
	"prompt_number": 49
	},
	{
	"cell_type": "markdown",
	"source": [
	"### Without these features (baseline) ###"
	]
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": [
	"annotation_stats.pretense_experiment(use_pretense=False)"
	],
	"language": "python",
	"outputs": [
	{
	"output_type": "stream",
	"stream": "stdout",
	"text": [
	"building progressive/conservative model!",
	"",
	"recalls:"
	]
	},
	{
	"output_type": "stream",
	"stream": "stdout",
	"text": [
	"",
	"[0.8275862068965517, 1.0, 0.6470588235294118, 1.0, 0.8333333333333334]",
	"average: 0.861595672752",
	"",
	"precisions:",
	"[0.26666666666666666, 0.248868778280543, 0.29464285714285715, 0.24545454545454545, 0.32894736842105265]",
	"average: 0.276916043193",
	"",
	"F1s:",
	"[0.4033613445378151, 0.39855072463768115, 0.40490797546012275, 0.39416058394160586, 0.4716981132075471]",
	"average: 0.414535748357"
	]
	}
	],
	"prompt_number": 27
	},
	{
	"cell_type": "markdown",
	"source": [
	"### With these features ###"
	]
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": [
	"annotation_stats.pretense_experiment(use_pretense=True)"
	],
	"language": "python",
	"outputs": [
	{
	"output_type": "stream",
	"stream": "stdout",
	"text": [
	"building progressive/conservative model!",
	"",
	"recalls:"
	]
	},
	{
	"output_type": "stream",
	"stream": "stdout",
	"text": [
	"",
	"[0.5, 1.0, 0.4117647058823529, 1.0, 0.45]",
	"average: 0.672352941176",
	"",
	"precisions:",
	"[0.27102803738317754, 0.24663677130044842, 0.21649484536082475, 0.24545454545454545, 0.23275862068965517]",
	"average: 0.242474564038",
	"",
	"F1s:",
	"[0.3515151515151515, 0.39568345323741005, 0.28378378378378377, 0.39416058394160586, 0.3068181818181818]",
	"average: 0.346392230859"
	]
	},
	{
	"output_type": "stream",
	"stream": "stderr",
	"text": [
	"/Library/Frameworks/Python.framework/Versions/7.3/lib/python2.7/site-packages/scipy/sparse/compressed.py:486: SparseEfficiencyWarning: changing the sparsity structure of a csr_matrix is expensive. lil_matrix is more efficient.",
	" SparseEfficiencyWarning)"
	]
	}
	],
	"prompt_number": 28
	},
	{
	"cell_type": "markdown",
	"source": [
	"In other words, this unfortunately performs worse! However, the additional features are predictive (or seem to be)! Consider",
	"",
	"`print clf.coef_[0][liberal_j]`",
	"",
	"`1.15580604539`",
	"",
	"Which says that a relatively large weight is associated with 1-[prob liberal comment(i)] given that is in the liberal subreddit",
	"",
	"Uncertain why this performs worse, despite the feature ostensibly being informative!"
	]
	},
	{
	"cell_type": "markdown",
	"source": [
	"# Regrouping #",
	"",
	"I think the strategy to take is two-fold:",
	"",
	"1. Move prediction to sentence-level.",
	"2. Evaluate by taking into consideration multiple annotations (papers on this?)",
	"2.a. Maybe just add multiple copies of each instance (one per annotator) with assocaited label?",
	"",
	""
	]
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": [
	"reload(annotation_stats);"
	],
	"language": "python",
	"outputs": [],
	"prompt_number": 53
	},
	{
	"cell_type": "markdown",
	"source": [
	"First, I've moved to using the latest database (more labels). I've also implemented a baseline strategy that will guess randomly (with probability equal to the training proportion) that utterances are ironic. Let's see how that does.",
	"",
	"### baseline (random) ###"
	]
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": [
	"annotation_stats.pretense_experiment(model=\"baseline\", use_pretense=False)"
	],
	"language": "python",
	"outputs": [
	{
	"output_type": "stream",
	"stream": "stdout",
	"text": [
	"building progressive/conservative model!",
	"----------"
	]
	},
	{
	"output_type": "stream",
	"stream": "stdout",
	"text": [
	"",
	"baseline",
	"----------",
	"",
	"recalls:",
	"[0.20408163265306123, 0.26136363636363635, 0.22727272727272727, 0.21052631578947367, 0.24752475247524752]",
	"average: 0.230153812911",
	"",
	"precisions:",
	"[0.2222222222222222, 0.23232323232323232, 0.21052631578947367, 0.21978021978021978, 0.25773195876288657]",
	"average: 0.228516789776",
	"",
	"F1s:",
	"[0.2127659574468085, 0.2459893048128342, 0.21857923497267756, 0.21505376344086022, 0.25252525252525254]",
	"average: 0.22898270264",
	"",
	"AUCs:",
	"[0.2487828474292407, 0.24105211920592637, 0.23835961421097912, 0.2350927955831317, 0.29480698635592373]",
	"average: 0.251618872557",
	"",
	"average kappas:",
	"[-0.023785813439299915, 0.00630464591126308, 0.00948435849008785, -0.01311975524123987, 0.0027460928460390265]",
	"average average kappas: -0.00367409428663"
	]
	}
	],
	"prompt_number": 44
	},
	{
	"cell_type": "markdown",
	"source": [
	"So baseline gives an average F1 of .229 and an average pairwise kappa of -.004. Let's consider our baseline SVM approaches (no 'pretense'/context). First, we'll use SGD.",
	"",
	"### SGD ###"
	]
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": [
	"annotation_stats.pretense_experiment(model=\"SGD\")"
	],
	"language": "python",
	"outputs": [
	{
	"output_type": "stream",
	"stream": "stdout",
	"text": [
	"building progressive/conservative model!",
	"----------"
	]
	},
	{
	"output_type": "stream",
	"stream": "stdout",
	"text": [
	"",
	"SGD",
	"----------",
	"",
	"recalls:",
	"[0.32653061224489793, 0.6931818181818182, 0.8863636363636364, 0.49473684210526314, 0.9207920792079208]",
	"average: 0.664320997621",
	"",
	"precisions:",
	"[0.4, 0.27232142857142855, 0.25573770491803277, 0.34306569343065696, 0.28703703703703703]",
	"average: 0.311632372791",
	"",
	"F1s:",
	"[0.35955056179775274, 0.391025641025641, 0.3969465648854962, 0.4051724137931035, 0.43764705882352944]",
	"average: 0.398068448065",
	"",
	"AUCs:",
	"[0.35818002214454642, 0.31774591776953282, 0.32872691560169881, 0.35425753115781783, 0.34646571223040229]",
	"average: 0.341075219781",
	"",
	"average kappas:",
	"[0.1293372255888978, 0.0682644280882094, 0.03839949825357625, 0.06430445302607776, 0.026188986920469955]",
	"average average kappas: 0.0652989183754"
	]
	}
	],
	"prompt_number": 45
	},
	{
	"cell_type": "markdown",
	"source": [
	"So a clear improvement, at least, providing some sanity. ~.4 F1 and .065 kappa. ",
	"",
	"Sidenote on kappas, these are always going to be low when prevalence is low (as it is here), so these might not be as bad as they 'seem' (although admittedly, they're pretty bad: average pairwise human kappas are like .34 or something). It's clear that we're going to have think harder about metrics here. ",
	"",
	"Let's consider the linearSVC, which really should perform comparably.",
	"",
	"### LinearSVC ###"
	]
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": [
	"annotation_stats.pretense_experiment(model=\"SVC\")"
	],
	"language": "python",
	"outputs": [
	{
	"output_type": "stream",
	"stream": "stdout",
	"text": [
	"building progressive/conservative model!",
	"----------"
	]
	},
	{
	"output_type": "stream",
	"stream": "stdout",
	"text": [
	"",
	"SVC",
	"----------",
	"",
	"recalls:",
	"[0.2857142857142857, 0.3181818181818182, 0.375, 0.29473684210526313, 0.26732673267326734]",
	"average: 0.308191935735",
	"",
	"precisions:",
	"[0.3835616438356164, 0.35, 0.4024390243902439, 0.35443037974683544, 0.38571428571428573]",
	"average: 0.375229066737",
	"",
	"F1s:",
	"[0.32748538011695905, 0.3333333333333333, 0.3882352941176471, 0.32183908045977017, 0.31578947368421056]",
	"average: 0.337336512342",
	"",
	"AUCs:",
	"[0.36141385415903493, 0.31786163155807978, 0.32883497088741004, 0.35446682340522445, 0.34679745021447139]",
	"average: 0.341874946045",
	"",
	"average kappas:",
	"[0.1002457720814471, 0.09435598354585739, 0.15129435623787119, 0.039703416427287114, 0.10264635996727187]",
	"average average kappas: 0.0976491776519"
	]
	}
	],
	"prompt_number": 46
	},
	{
	"cell_type": "markdown",
	"source": [
	"So the first thing of note is that the F1s are notably different (0.34 here versus ~.4 above), which is slightly disconcerting still, although not as extreme as we saw before. SVC weirdly improves kappas, which is interesting. The other thing I would note that I find reassuring is that the AUCs are practically identical -- perhaps a lot depends simply on where you draw your threshold.",
	"",
	"Now, for some context.",
	"",
	"### PRETENSE ###",
	"",
	"First let's just add one feature: the predicted probability of an utterance having been intended ironically"
	]
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": [
	"annotation_stats.pretense_experiment(model=\"SGD\", use_pretense=True, interaction_features=False)"
	],
	"language": "python",
	"outputs": [
	{
	"output_type": "stream",
	"stream": "stdout",
	"text": [
	"building progressive/conservative model!",
	"----------"
	]
	},
	{
	"output_type": "stream",
	"stream": "stdout",
	"text": [
	"",
	"SGD",
	"----------",
	"",
	"recalls:",
	"[0.8775510204081632, 0.9318181818181818, 0.3522727272727273, 1.0, 0.7722772277227723]",
	"average: 0.786783831444",
	"",
	"precisions:",
	"[0.27388535031847133, 0.23631123919308358, 0.4025974025974026, 0.25885558583106266, 0.3132530120481928]",
	"average: 0.296980517998",
	"",
	"F1s:",
	"[0.4174757281553398, 0.37701149425287356, 0.3757575757575758, 0.41125541125541126, 0.4457142857142857]",
	"average: 0.405442899027",
	"",
	"AUCs:",
	"[0.33671307977713638, 0.29784249365180226, 0.33241860558960229, 0.27943621917281503, 0.34587712510096663]",
	"average: 0.318457504658",
	"",
	"average kappas:",
	"[0.011544960147504276, -0.00019105189665967297, 0.14907073654989755, 0.005039101166456329, 0.060664372033046206]",
	"average average kappas: 0.0452256236"
	]
	}
	],
	"prompt_number": 51
	},
	{
	"cell_type": "markdown",
	"source": [
	"A mixed bag. Massive gains in terms of F1 (?!) but at the expense of AUC (and kappa). It looks like we lost a bunch of precision."
	]
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": [
	"annotation_stats.pretense_experiment(model=\"SGD\", use_pretense=True, interaction_features=True)"
	],
	"language": "python",
	"outputs": [
	{
	"output_type": "stream",
	"stream": "stdout",
	"text": [
	"building progressive/conservative model!",
	"----------"
	]
	},
	{
	"output_type": "stream",
	"stream": "stdout",
	"text": [
	"",
	"SGD",
	"----------",
	"",
	"recalls:",
	"[0.40816326530612246, 0.5454545454545454, 0.5568181818181818, 0.45263157894736844, 0.6831683168316832]",
	"average: 0.529247177672",
	"",
	"precisions:",
	"[0.23809523809523808, 0.24742268041237114, 0.2677595628415301, 0.24157303370786518, 0.3165137614678899]",
	"average: 0.262272855305",
	"",
	"F1s:",
	"[0.30075187969924805, 0.3404255319148936, 0.3616236162361624, 0.315018315018315, 0.43260188087774293]",
	"average: 0.350084244749",
	"",
	"AUCs:",
	"[0.32969153263719136, 0.27821444144388791, 0.35271999025127515, 0.261655108701911, 0.34839131192164463]",
	"average: 0.314134476991",
	"",
	"average kappas:",
	"[-0.03521849489188062, 0.01840374284147445, 0.018843041275619674, -0.019045570083904603, 0.04343870334834959]",
	"average average kappas: 0.00528428449793"
	]
	}
	],
	"prompt_number": 54
	},
	{
	"cell_type": "markdown",
	"source": [
	"Almost suspiciously bad -- I wonder if I haven't futzed something up with my creation of the interaction features. ",
	"",
	"But I don't think so, in part because these terms seem really, really predictive, e.g.,:",
	"",
	"print clf.best_estimator_.coef_[0,liberal_j]",
	"",
	"0.263161208169",
	"",
	"print clf.best_estimator_.coef_[0,liberal_j-1]",
	"",
	"0.882066517978",
	"",
	"print clf.best_estimator_.coef_[0,conservative_j]",
	"",
	"-0.649088581092",
	"",
	"So the best predictor (biggest weight in learned vector) is the `liberal intercept', which says that the associated comment was in the liberal subreddit; this is a positive weight, suggesting liberals are more likely to be ironic. This makes sense, I guess. But, confusingly, weights corresponding to the predicted probability that an utterance belongs to the liberal subreddit but is in the conservative subreddit is negative?!?!",
	"",
	"Sanity check on most predictive features from progressive/conservative classifier (1 being progressive):",
	" ",
	"\t-2.8326\tobama \t\t1.5667\tprogressive ",
	"\t-1.2036\tobamacare \t\t1.2391\tgop ",
	"\t-1.0678\tdemocrats \t\t1.2323\trepublican ",
	"\t-1.0602\tliberal \t\t1.0716\tcorporate ",
	"\t-1.0333\tracist \t\t1.0404\tstate ",
	"\t-1.0328\tadministration \t\t1.0312\tpaul ",
	"",
	"So this seems OK. I really don't know what's going on. My conclusion so far is that, weirdly, the models seem to find that these features are predictive, but then it seems to hurt predictive performance. Even more concerning, the weights are in unexpected directions! One possibility here is that we need to cross the perceived likelihood of a comment coming from the conservative/progressive subreddit with the perceived sentiment, e.g., using the NNP/sentiment model we discussed.",
	"",
	"I am feeling better about disparities between SVC/SGD, as these differences now seem rather negligible. I'm not sure how to square this confusion with the regression result presented at the outset of this notebook: perhaps the effect is only due to the subreddit (liberals tend to be ironic more)? One idea would be to introduce interaction features crossing words with the subreddit (this would get at whether, e.g., conservatives tend to use certain tokens ironically). Open to other ideas, of course :). Let's move on to sentence-level prediction, which may be a better task."
	]
	},
	{
	"cell_type": "code",
	"collapsed": true,
	"input": [],
	"language": "python",
	"outputs": []
	}
	]
	}
	]
	}