Skip to content

Instantly share code, notes, and snippets.

@brendano
Last active December 20, 2015 04:29
Show Gist options
  • Save brendano/6070886 to your computer and use it in GitHub Desktop.
Save brendano/6070886 to your computer and use it in GitHub Desktop.
===
[Update July 25... and after https://gist.github.com/leondz/6082658 ]
OK never mind the questions about cross-validation versus a smaller eval split and all that.
We evaluated our tagger (current release, version 0.3.2),
trained and evaluated on the same splits as the GATE tagger
(from http://gate.ac.uk/wiki/twitter-postagger.html and specifically twitie-tagger.zip)
and it gets 90.4% accuracy (significantly different than the GATE results).
Files and details: http://www.ark.cs.cmu.edu/TweetNLP/gate_ritterptb_comparison.zip
===
[The below was from July 24]
Analyzing the accuracy results in Derczynski et al. RANLP-2013
http://derczynski.com/sheffield/papers/twitter_pos.pdf
with regard to claims in e.g.
https://twitter.com/LeonDerczynski/status/357442842035109889
https://twitter.com/LeonDerczynski/status/359951338714570752
All comparisons are on the Ritter et al. EMNLP-2011 dataset.
It's 15185 tokens total.
They report 88.7% accuracy on 15% of the Ritter data (which should be about
2372 tokens) as held-out evaluation data. (Table 8)
We (Owoputi et al. NAACL-2013) used cross-validation on this dataset (following
the protocol in the original Ritter 2011 paper) and did no development or
tuning at all. (We did all of that on a totally different dataset, because we
don't like the linguistic choices in the Ritter dataset, as described in our
paper.) We got 90.0% accuracy. (Table 4)
These sizes are all small so statistical power isn't good. However, they are
definitely comparable since they're samples from the same dataset, and no one
tuned on these test sets during development. Assuming the splits were randomly
chosen, we can compare with an unpaired t-test for the hypotehsis of true token
accuracy rates. (I'm treating the cross-validation "test set" as the
aggregation of all the splits, size 15185.) There's a clear difference
(p<0.04):
$ R
> owoputi={p=.9;n=15185; c(rep(1,round(p*n)), rep(0,round((1-p)*n)))}
> ranlp={p=.8869;n=15185*.15; c(rep(1,round(p*n)), rep(0,round((1-p)*n)))}
> t.test(owoputi,ranlp, alt='greater')
Welch Two Sample t-test
data: owoputi and ranlp
t = 1.825, df = 2922.547, p-value = 0.03405
Note, as far as 95% intervals go, their accuracy is 88.7 +/- 1.3, while we are
90.0 +/- 0.5. (The intervals overlap but running the t-test has better power
than a naive overlap test. Or simulation gives the same answer too.)
The deeper issue is that we need to annotate new data to really address the
problem of PTB-style POS tagging for Twitter. The Ritter dataset is a nice
start but is quite small; since it is so small, we and the original Ritter
paper did cross-validation to evaluate accuracy, but it's starting to get
dev/tuning-tainted for future development. Splitting an evaluation set off
from within the Ritter dataset is also unsatisfactory since it's so small --
Derczynski's 15% size gives confidence intervals of +/- 1.3%, which are awfully
wide! Also there are various issues including: the dataset was created by a
single annotator with no agreement analysis, has no tagging standards or
information how they dealt with social media-specific linguistic issues, and
there are still questions about what's the best way to adapt PTB conventions to
Twitter (e.g. the compounds issue).
References
Owoputi et al NAACL 2013: http://www.ark.cs.cmu.edu/TweetNLP/owoputi+etal.naacl13.pdf
Ritter et al EMNLP 2011: http://turing.cs.washington.edu/papers/ritter-emnlp2011-twitter_ner.pdf
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment