You should be able to paste the code below into RStudio to plot this data and play with it yourself.
The first scatterplot graph (thanks to Allen Goodman):
library("ggplot2") # install.packages("ggplot2")
library("RCurl") # install.packages("RCurl")
observations <- getURL("https://gist.githubusercontent.com/mrb/ea9a2aa3f41e36f37035/raw/159380f9658e47569fd048a3c6baee858e65ce8d/gistfile1.txt")
observations <- read.csv(text = observations)
observations$GPA <- round(as.numeric(as.character(observations$GPA)), 2)
observations <- observations[which(observations$AuthorCount > 0), ]
observations <- observations[which(observations$AuthorCount < 10), ]
ggplot(observations, aes(x = AuthorCount, y = GPA)) + geom_point(shape = 1) + geom_smooth(method = lm, se = FALSE) + labs(title="GPA of Repos by Author Count") + xlab("Author Count")
The line graph (thanks to JD Maturen):
library(ggplot2)
library(RCurl)
observations <- getURL("https://gist.githubusercontent.com/mrb/2975281ff4e5306f2955/raw/53e35ef87c6df3e8dd366471a5638b7f7e448f75/binned_data.csv")
observations <- read.csv(text = observations)
observations$bucket <- factor(observations$bucket, levels=c("1+", "2+", "3+", "5+", "10+"), labels=c("1", "2", "3-4", "5-9", "10+"))
ggplot(observations, aes(gpa, color=bucket)) + geom_density(size=2) + scale_x_reverse() + labs(title="Density of GPAs per Team Size") + xlab("GPA") + ylab("Density") + guides(color=guide_legend(title="Team Size"))
Is available here in raw form:
And here in the binned form:
enjoy! Please let us know if you do anything cool with it!
Here's some more analysis I did.
Mainly I was interested in the GPA distribution (why is there such a high proportion of GPAs of 4.0?)
and how team size affects GPA. It turns out that team size affects whether your GPA is 4.0 really strongly, but if your GPA isn't 4.0, it doesn't matter so much.
http://nbviewer.ipython.org/urls/gist.githubusercontent.com/jvns/f33f96a7a3a6f833a36c/raw/1f4dd8d72d0b9878495d69ad025741168f3b2ab1/gpa_vs_team_size.ipynb