Skip to content

Instantly share code, notes, and snippets.

@tdunning
Created June 18, 2021 07:30
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save tdunning/a10d79bc4cb3834cf2ecea96e70cb9f7 to your computer and use it in GitHub Desktop.
Save tdunning/a10d79bc4cb3834cf2ecea96e70cb9f7 to your computer and use it in GitHub Desktop.
Snippet of R to recreate an analysis of t-digest interpolation on real data
# Analysis of how two t-digests see some sample data
png("figure.png", width=1200, height=1000, points=30)
# the first few actual data points with filler for the remainder
d = c(241, 543, 575, 702, 890, 1530, 1940, 2166, 2168, rep(3000,33))
# the cumulative distribution function
f = ecdf(d)
# plot the actual CDF
plot(x=d, y=f(d), xlim = c(700, 2300), ylim = c(0.08, 0.25), type='s',
xlab="Sample value", ylab="Cumulative Distribution Function",
cex.lab=1.3)
# now plot the results from t-digest number 1 where it starts interpolating
lines(c(890, 1735, 2167,2333), c(5,6,8,10)/42, col=rgb(1,0,0,0.4), type='b', lwd=5, pch=NA)
# highlight the centroids
points(c(1735, 2167,2333), c(6,8,10)/42, pch=21, cex=1.5, lwd=5, col=rgb(1,0,0,0.4))
text(1735, 5.6/42, expression(w==2), adj=0.2)
text(2040, 8.2/42, expression(w==2))
text(2250, 10.4/42, expression(w==2))
# and the same for t-digest number 2 for the places it is interpolating
lines(c(1530,2053,2280, 2563), c(6,7,9,12)/42, col=rgb(0,0,1,0.4),lwd=5, type='b',lty=2, pch=NA)
points(c(2053, 2280, 2563), c(7,9,12)/42, pch=21, cex=1.5, lwd=5, col=rgb(0,0,1,0.4))
points(x=d, y=f(d)-0.5/42, pch=21, cex=0.5)
text(2053, 6.6/42, expression(w==2), adj=0.2)
text(2160, 9.2/42, expression(w==3))
text(2563, 10/42, expression(w==3))
# now mark the estimates with error bars
# the first t-digest interpolates
points(923.80, 0.12, pch=13, cex=2, lwd=3)
arrows(923.80-250,0.12,923.80+250,0.12, angle=90, col=rgb(1,0,0,0.3), lwd=6, code=3, length=0.1)
# the second t-digest is not interpolating yet so it gets the exact result
points(1530, 0.12, pch=13, cex=2, lwd=3)
arrows(1530-250,0.12,1530+250, 0.12, angle=90, col=rgb(0,0,1,0.3), lwd=6, code=3, length=0.1)
legend(700, 0.24, legend=c("Estimated values", "Exact sample", "t-digest A", "t-digest B"), pch=c(13, 21, 21, 21), col=c('black', 'black', 'red', 'blue'), pt.cex=c(2,0.5,1.5,1.5), pt.lwd=c(3,2,6,6))
dev.off()
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment