This dataset explores the relationships between programming languages.
Example: How likely is it that a programmer who writes in Objective-C also programs in Java? (31%)
GitHub identifies the programming languages used in each repository as well as discerning what the primary programming language is. Active GitHub.com users have a list of programming languages that they have used which is based on the language information in their repositories.
These relationships between programming languages are asymmetrical. To determine the relationship from language A to B, we count the number of times the pair were seen together and divide by the total number of A. We divide the pair count by the total number of B to get the relationship from B to A.
Example data:
- Nine people have repositories written in Ruby only
- Two people have repositories written in Ruby and PHP
- One person has repositories written in PHP only
Example results:
- The correlation between PHP to Ruby is 66.7% (2/3 of people who use PHP also use Ruby)
- The correlation between Ruby to PHP is 20% (1/5 of people who use Ruby also use PHP)
The data was gathered on March 2nd, 2012 and was published on April 9th, 2012.
The dataset is in JSON format.
The correlation from CoffeeScript to Ruby:
{
"from": "CoffeeScript",
"correlation": "87.9",
"to": "Ruby"
}
The correlation from Ruby to CoffeeScript:
{
"from": "Ruby",
"correlation": "17.7",
"to": "CoffeeScript"
}
Thanks! this should be interesting.
Quick question: when a given language pair is missing, presumably we should we assume 0%? (except for the missing diagonal entries which must be 100%).
In case anyone else was briefly confused by the terminology here: it sounds like this is a matrix of (estimates of) conditional probabilities, not technically of correlations which would be symmetric amongst other things.