brendano/gist:6070886

## gistfile1.txt
===
[Update July 25... and after https://gist.github.com/leondz/6082658 ]

OK never mind the questions about cross-validation versus a smaller eval split and all that.

We evaluated our tagger (current release, version 0.3.2),
trained and evaluated on the same splits as the GATE tagger
(from http://gate.ac.uk/wiki/twitter-postagger.html and specifically twitie-tagger.zip)
and it gets 90.4% accuracy (significantly different than the GATE results).

Files and details: http://www.ark.cs.cmu.edu/TweetNLP/gate_ritterptb_comparison.zip


===
[The below was from July 24]

Analyzing the accuracy results in Derczynski et al. RANLP-2013
http://derczynski.com/sheffield/papers/twitter_pos.pdf

with regard to claims in e.g.
https://twitter.com/LeonDerczynski/status/357442842035109889
https://twitter.com/LeonDerczynski/status/359951338714570752

All comparisons are on the Ritter et al. EMNLP-2011 dataset.
It's 15185 tokens total.

They report 88.7% accuracy on 15% of the Ritter data (which should be about
2372 tokens) as held-out evaluation data.  (Table 8)

We (Owoputi et al. NAACL-2013) used cross-validation on this dataset (following
the protocol in the original Ritter 2011 paper) and did no development or
tuning at all.  (We did all of that on a totally different dataset, because we
don't like the linguistic choices in the Ritter dataset, as described in our
paper.)  We got 90.0% accuracy.  (Table 4)

These sizes are all small so statistical power isn't good.  However, they are
definitely comparable since they're samples from the same dataset, and no one
tuned on these test sets during development.  Assuming the splits were randomly
chosen, we can compare with an unpaired t-test for the hypotehsis of true token
accuracy rates.  (I'm treating the cross-validation "test set" as the
aggregation of all the splits, size 15185.)  There's a clear difference
(p<0.04):

$ R
> owoputi={p=.9;n=15185;  c(rep(1,round(p*n)), rep(0,round((1-p)*n)))}
> ranlp={p=.8869;n=15185*.15;  c(rep(1,round(p*n)), rep(0,round((1-p)*n)))}
> t.test(owoputi,ranlp, alt='greater')
    Welch Two Sample t-test
data:  owoputi and ranlp
t = 1.825, df = 2922.547, p-value = 0.03405

Note, as far as 95% intervals go, their accuracy is 88.7 +/- 1.3, while we are
90.0 +/- 0.5.  (The intervals overlap but running the t-test has better power
than a naive overlap test. Or simulation gives the same answer too.)

The deeper issue is that we need to annotate new data to really address the
problem of PTB-style POS tagging for Twitter.  The Ritter dataset is a nice
start but is quite small; since it is so small, we and the original Ritter
paper did cross-validation to evaluate accuracy, but it's starting to get
dev/tuning-tainted for future development.  Splitting an evaluation set off
from within the Ritter dataset is also unsatisfactory since it's so small --
Derczynski's 15% size gives confidence intervals of +/- 1.3%, which are awfully
wide!  Also there are various issues including: the dataset was created by a
single annotator with no agreement analysis, has no tagging standards or
information how they dealt with social media-specific linguistic issues, and
there are still questions about what's the best way to adapt PTB conventions to
Twitter (e.g. the compounds issue).

References
Owoputi et al NAACL 2013: http://www.ark.cs.cmu.edu/TweetNLP/owoputi+etal.naacl13.pdf
Ritter et al EMNLP 2011: http://turing.cs.washington.edu/papers/ritter-emnlp2011-twitter_ner.pdf
	===
	[Update July 25... and after https://gist.github.com/leondz/6082658 ]

	OK never mind the questions about cross-validation versus a smaller eval split and all that.

	We evaluated our tagger (current release, version 0.3.2),
	trained and evaluated on the same splits as the GATE tagger
	(from http://gate.ac.uk/wiki/twitter-postagger.html and specifically twitie-tagger.zip)
	and it gets 90.4% accuracy (significantly different than the GATE results).

	Files and details: http://www.ark.cs.cmu.edu/TweetNLP/gate_ritterptb_comparison.zip


	===
	[The below was from July 24]

	Analyzing the accuracy results in Derczynski et al. RANLP-2013
	http://derczynski.com/sheffield/papers/twitter_pos.pdf

	with regard to claims in e.g.
	https://twitter.com/LeonDerczynski/status/357442842035109889
	https://twitter.com/LeonDerczynski/status/359951338714570752

	All comparisons are on the Ritter et al. EMNLP-2011 dataset.
	It's 15185 tokens total.

	They report 88.7% accuracy on 15% of the Ritter data (which should be about
	2372 tokens) as held-out evaluation data. (Table 8)

	We (Owoputi et al. NAACL-2013) used cross-validation on this dataset (following
	the protocol in the original Ritter 2011 paper) and did no development or
	tuning at all. (We did all of that on a totally different dataset, because we
	don't like the linguistic choices in the Ritter dataset, as described in our
	paper.) We got 90.0% accuracy. (Table 4)

	These sizes are all small so statistical power isn't good. However, they are
	definitely comparable since they're samples from the same dataset, and no one
	tuned on these test sets during development. Assuming the splits were randomly
	chosen, we can compare with an unpaired t-test for the hypotehsis of true token
	accuracy rates. (I'm treating the cross-validation "test set" as the
	aggregation of all the splits, size 15185.) There's a clear difference
	(p<0.04):

	$ R
	> owoputi={p=.9;n=15185; c(rep(1,round(pn)), rep(0,round((1-p)n)))}
	> ranlp={p=.8869;n=15185.15; c(rep(1,round(pn)), rep(0,round((1-p)*n)))}
	> t.test(owoputi,ranlp, alt='greater')
	Welch Two Sample t-test
	data: owoputi and ranlp
	t = 1.825, df = 2922.547, p-value = 0.03405

	Note, as far as 95% intervals go, their accuracy is 88.7 +/- 1.3, while we are
	90.0 +/- 0.5. (The intervals overlap but running the t-test has better power
	than a naive overlap test. Or simulation gives the same answer too.)

	The deeper issue is that we need to annotate new data to really address the
	problem of PTB-style POS tagging for Twitter. The Ritter dataset is a nice
	start but is quite small; since it is so small, we and the original Ritter
	paper did cross-validation to evaluate accuracy, but it's starting to get
	dev/tuning-tainted for future development. Splitting an evaluation set off
	from within the Ritter dataset is also unsatisfactory since it's so small --
	Derczynski's 15% size gives confidence intervals of +/- 1.3%, which are awfully
	wide! Also there are various issues including: the dataset was created by a
	single annotator with no agreement analysis, has no tagging standards or
	information how they dealt with social media-specific linguistic issues, and
	there are still questions about what's the best way to adapt PTB conventions to
	Twitter (e.g. the compounds issue).

	References
	Owoputi et al NAACL 2013: http://www.ark.cs.cmu.edu/TweetNLP/owoputi+etal.naacl13.pdf
	Ritter et al EMNLP 2011: http://turing.cs.washington.edu/papers/ritter-emnlp2011-twitter_ner.pdf