Skip to content

Instantly share code, notes, and snippets.

@briandoll
Created April 30, 2012 17:12
Show Gist options
  • Save briandoll/e0637fff9c8eec988528 to your computer and use it in GitHub Desktop.
Save briandoll/e0637fff9c8eec988528 to your computer and use it in GitHub Desktop.

Dataset: Programming Language Correlations

This dataset explores the relationships between programming languages.

Example: How likely is it that a programmer who writes in Objective-C also programs in Java? (31%)

How is the data collected?

GitHub identifies the programming languages used in each repository as well as discerning what the primary programming language is. Active GitHub.com users have a list of programming languages that they have used which is based on the language information in their repositories.

How was the data analyzed?

These relationships between programming languages are asymmetrical. To determine the relationship from language A to B, we count the number of times the pair were seen together and divide by the total number of A. We divide the pair count by the total number of B to get the relationship from B to A.

Example data:

  • Nine people have repositories written in Ruby only
  • Two people have repositories written in Ruby and PHP
  • One person has repositories written in PHP only

Example results:

  • The correlation between PHP to Ruby is 66.7% (2/3 of people who use PHP also use Ruby)
  • The correlation between Ruby to PHP is 20% (1/5 of people who use Ruby also use PHP)

When was the data published?

The data was gathered on March 2nd, 2012 and was published on April 9th, 2012.

What format is the data in?

The dataset is in JSON format.

The correlation from CoffeeScript to Ruby:

{
  "from": "CoffeeScript",
  "correlation": "87.9",
  "to": "Ruby"
}

The correlation from Ruby to CoffeeScript:

{
  "from": "Ruby",
  "correlation": "17.7",
  "to": "CoffeeScript"
}
@mjwillson
Copy link

Thanks! this should be interesting.

Quick question: when a given language pair is missing, presumably we should we assume 0%? (except for the missing diagonal entries which must be 100%).

In case anyone else was briefly confused by the terminology here: it sounds like this is a matrix of (estimates of) conditional probabilities, not technically of correlations which would be symmetric amongst other things.

@briandoll
Copy link
Author

@mjwillson Yes, if a pair is missing, you can assume 0%.

RE: "matrix of conditional probabilities" - thank you! I know "correlation" wasn't technically correct, after having read a ton about it, but I never found the right terminology. A friend brought up "contingency tables" and "chi-squared tests" as an explanation of what this data represents, as well.

@mjwillson
Copy link

@akshayjshah
Copy link

It's a bit late, but I used GitHub Archive to compute and plot a correlation matrix for the most popular languages. The code is in a gist, and I also wrote a blog post about it.

@rmattsampson
Copy link

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment