Skip to content

Instantly share code, notes, and snippets.

@kylebgorman
Created July 10, 2011 17:38
Show Gist options
  • Save kylebgorman/1074739 to your computer and use it in GitHub Desktop.
Save kylebgorman/1074739 to your computer and use it in GitHub Desktop.
The Z_r averaging transform in R; very useful for studying the statistical properties of sparse data
# Z_r (or "averaging") transform functions, based on:
#
# Kenneth W. Church and William A. Gale. 1991. A comparison of the enhanced
# Good-Turing and deleted estimation methods for estimating probabilities of
# English bigrams. Computer Speech and Language 5(1):19--54
#
# Kyle Gorman <kgorman@ling.upenn.edu>
#
# Church and Gale do not say what is to be done about points at the edges. I
# have chosen to average them with respect to only the inward facing frequency,
# which seems consistent to me with what Church and Gale had in mind. Comments
# are welcome about this choice, of course.
#
# I am making this code available because several people have told me that it's
# not obvious.
#
# There are versions for r/n_r vectors, and one for a single integer frequency
# vector f (if it's already probabilties, multiply it out)
Zr.nr <- function(r, nr) {
# compute a smoothed freq distribution using Z_r statistic
zro <- nr
zro[1] <- zro[1] / (r[2] - r[1])
L <- length(nr)
i <- 2
while (i < L) {
zro[i] <- 2 * zro[i] / (r[i + 1] - r[i - 1])
i <- i + 1
}
zro[L] <- zro[L] / (r[L] - r[L - 1])
return(zro)
}
Zr.f <- function(f) {
# f is a vector of integer frequencies. returns a data frame for plotting
q <- rle(sort(f))
r <- q$values
nr <- q$lengths
Zr <- Zr.nr(r, nr)
return(data.frame(r, nr, Zr))
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment