Last active
December 20, 2015 04:29
-
-
Save brendano/6070886 to your computer and use it in GitHub Desktop.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
=== | |
[Update July 25... and after https://gist.github.com/leondz/6082658 ] | |
OK never mind the questions about cross-validation versus a smaller eval split and all that. | |
We evaluated our tagger (current release, version 0.3.2), | |
trained and evaluated on the same splits as the GATE tagger | |
(from http://gate.ac.uk/wiki/twitter-postagger.html and specifically twitie-tagger.zip) | |
and it gets 90.4% accuracy (significantly different than the GATE results). | |
Files and details: http://www.ark.cs.cmu.edu/TweetNLP/gate_ritterptb_comparison.zip | |
=== | |
[The below was from July 24] | |
Analyzing the accuracy results in Derczynski et al. RANLP-2013 | |
http://derczynski.com/sheffield/papers/twitter_pos.pdf | |
with regard to claims in e.g. | |
https://twitter.com/LeonDerczynski/status/357442842035109889 | |
https://twitter.com/LeonDerczynski/status/359951338714570752 | |
All comparisons are on the Ritter et al. EMNLP-2011 dataset. | |
It's 15185 tokens total. | |
They report 88.7% accuracy on 15% of the Ritter data (which should be about | |
2372 tokens) as held-out evaluation data. (Table 8) | |
We (Owoputi et al. NAACL-2013) used cross-validation on this dataset (following | |
the protocol in the original Ritter 2011 paper) and did no development or | |
tuning at all. (We did all of that on a totally different dataset, because we | |
don't like the linguistic choices in the Ritter dataset, as described in our | |
paper.) We got 90.0% accuracy. (Table 4) | |
These sizes are all small so statistical power isn't good. However, they are | |
definitely comparable since they're samples from the same dataset, and no one | |
tuned on these test sets during development. Assuming the splits were randomly | |
chosen, we can compare with an unpaired t-test for the hypotehsis of true token | |
accuracy rates. (I'm treating the cross-validation "test set" as the | |
aggregation of all the splits, size 15185.) There's a clear difference | |
(p<0.04): | |
$ R | |
> owoputi={p=.9;n=15185; c(rep(1,round(p*n)), rep(0,round((1-p)*n)))} | |
> ranlp={p=.8869;n=15185*.15; c(rep(1,round(p*n)), rep(0,round((1-p)*n)))} | |
> t.test(owoputi,ranlp, alt='greater') | |
Welch Two Sample t-test | |
data: owoputi and ranlp | |
t = 1.825, df = 2922.547, p-value = 0.03405 | |
Note, as far as 95% intervals go, their accuracy is 88.7 +/- 1.3, while we are | |
90.0 +/- 0.5. (The intervals overlap but running the t-test has better power | |
than a naive overlap test. Or simulation gives the same answer too.) | |
The deeper issue is that we need to annotate new data to really address the | |
problem of PTB-style POS tagging for Twitter. The Ritter dataset is a nice | |
start but is quite small; since it is so small, we and the original Ritter | |
paper did cross-validation to evaluate accuracy, but it's starting to get | |
dev/tuning-tainted for future development. Splitting an evaluation set off | |
from within the Ritter dataset is also unsatisfactory since it's so small -- | |
Derczynski's 15% size gives confidence intervals of +/- 1.3%, which are awfully | |
wide! Also there are various issues including: the dataset was created by a | |
single annotator with no agreement analysis, has no tagging standards or | |
information how they dealt with social media-specific linguistic issues, and | |
there are still questions about what's the best way to adapt PTB conventions to | |
Twitter (e.g. the compounds issue). | |
References | |
Owoputi et al NAACL 2013: http://www.ark.cs.cmu.edu/TweetNLP/owoputi+etal.naacl13.pdf | |
Ritter et al EMNLP 2011: http://turing.cs.washington.edu/papers/ritter-emnlp2011-twitter_ner.pdf |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment