secret

  • Download Gist
Programming Language Correlation Dataset.md
Markdown

Dataset: Programming Language Correlations

This dataset explores the relationships between programming languages.

Example: How likely is it that a programmer who writes in Objective-C also programs in Java? (31%)

How is the data collected?

GitHub identifies the programming languages used in each repository as well as discerning what the primary programming language is. Active GitHub.com users have a list of programming languages that they have used which is based on the language information in their repositories.

How was the data analyzed?

These relationships between programming languages are asymmetrical. To determine the relationship from language A to B, we count the number of times the pair were seen together and divide by the total number of A. We divide the pair count by the total number of B to get the relationship from B to A.

Example data:

  • Nine people have repositories written in Ruby only
  • Two people have repositories written in Ruby and PHP
  • One person has repositories written in PHP only

Example results:

  • The correlation between PHP to Ruby is 66.7% (2/3 of people who use PHP also use Ruby)
  • The correlation between Ruby to PHP is 20% (1/5 of people who use Ruby also use PHP)

When was the data published?

The data was gathered on March 2nd, 2012 and was published on April 9th, 2012.

What format is the data in?

The dataset is in JSON format.

The correlation from CoffeeScript to Ruby:

{
  "from": "CoffeeScript",
  "correlation": "87.9",
  "to": "Ruby"
}

The correlation from Ruby to CoffeeScript:

{
  "from": "Ruby",
  "correlation": "17.7",
  "to": "CoffeeScript"
}

Thanks! this should be interesting.

Quick question: when a given language pair is missing, presumably we should we assume 0%? (except for the missing diagonal entries which must be 100%).

In case anyone else was briefly confused by the terminology here: it sounds like this is a matrix of (estimates of) conditional probabilities, not technically of correlations which would be symmetric amongst other things.

@mjwillson Yes, if a pair is missing, you can assume 0%.

RE: "matrix of conditional probabilities" - thank you! I know "correlation" wasn't technically correct, after having read a ton about it, but I never found the right terminology. A friend brought up "contingency tables" and "chi-squared tests" as an explanation of what this data represents, as well.

It's a bit late, but I used GitHub Archive to compute and plot a correlation matrix for the most popular languages. The code is in a gist, and I also wrote a blog post about it.

Please sign in to comment on this gist.

Something went wrong with that request. Please try again.