Skip to content

Instantly share code, notes, and snippets.

@jwlrs
Created May 11, 2012 00:56
Show Gist options
  • Save jwlrs/2656851 to your computer and use it in GitHub Desktop.
Save jwlrs/2656851 to your computer and use it in GitHub Desktop.
Would a GitHub backed site to determine common coding practices be useful?

A GitHub coding pattern analysis site

What if we could use GitHub to determine how common certain coding practices are among savvy developers? Imagine a site where you could search for particular coding patterns, or possibly just view total counts of particular patterns already deemed important. This could be very useful for language and library designers, students new to a language, and more.

As background, over at https://github.com/JSFixed/JSFixed we are discussing what features we want to see in the next generation of ECMAScript (though discussion has focused on JavaScript specifically for the most part). In https://github.com/JSFixed/JSFixed/issues/49, https://github.com/polotek asked "Is there any quantitative data to replace the subjective argument that 'lots of people need this all the time'?" One answer is that there could be plenty of quantitative data available if we mined GitHub itself.

This is a bit like Google Code Search was, but more geared around establishing answers about lots of code than searching for specific code. And it would be a lot more sophisticated than GitHub's own search.

An Example Use for JavaScript

You would arrive at this site and chose JavaScript. From here you would see a list of questions about usage and counts across projects. For example, simple cases requiring mostly simple searches, file counts and mechanically using standard tools:

  • How many repos include multiple non-library, non-test scripts?
  • How many repos include/assume jQuery?
  • How many scripts pass JSLint (possibly measured with several levels of tolerance)?
  • What proportion of repos include eval() in non-library code?

More complex examples requiring smart regexes and small scripts to determine:

  • What is the distribution of variable name lengths, function name lengths, etc?
  • What is the distribution of different combinations of the primary jQuery.ajax function parameters?
  • How often are patterns like ( x && x.y && x.y.z ) used?

And others requiring some underlying linkage to the commit history of a repo:

  • Can we identify types of code patterns that seem to cause lots of volatility and bugs?
  • Do functions leveraging particular library X get changed more frequently than library Y?

What kinds of counts we are showing (number of projects? ratios? percentages?) are a bit ambiguous here, and perhaps different types of questions need different types of counting. It might also need to rate by repo popularity, etc.

Highest Level Implementation sketch

Here is an idea of how such a site could work behind the scenes:

  • At a given interval, this site would pull/clone all repos that have been updated/added since the last interval (at least all repos containing the languages the site deals with).
  • A set of sanctioned question-scripts could be run, one for each "question", updating counts.
  • The source of these question-scripts would be a Github repo managed by some responsible parties
  • Some means of doing on-the-fly queries could be included if safe and performant
  • Any user could propose a question for a particular language and someone else might right a script to answer it
  • New languages would be added over time
  • Other sources of code could be added such as SourceForge, Google Code or StackOverflow

Questions

  1. Most of all, does something like this already exist?
  2. Would any of you want to use such a site?
  3. Would any of you contribute to such a project?
  4. Are there query languages / tools that would work really well for this kind of project?
  5. Are there enough interesting questions to make this an ongoing site or would a one-time analysis project answer most important questions?
@polotek
Copy link

polotek commented May 11, 2012

This is a great outline for a project. For the purposes of discussing ES.next, I don't think it even needs to be super robust. As long as there is some rigor in the routines that identify code patterns. As for getting something like this going, you could start with setting up the basic scaffolding.

  • put in a git repo and it'll get cloned
  • what does a query routine look like?
    • identify relevant files (input)
    • collect patterns (processing)
    • return pattern stats (output)

If you got something this simple set up, I'd start playing around with it. Can't promise much help up front though.

@ariya
Copy link

ariya commented May 11, 2012

I believe you can use Esprima to "learn" about the semantics and patterns in the code. Some related examples:

http://ariya.ofilabs.com/2012/04/most-popular-javascript-statements.html

http://ariya.ofilabs.com/2012/03/most-popular-javascript-keywords.html

@jwlrs
Copy link
Author

jwlrs commented May 23, 2012

Thank you all for the input. This still sounds like an exciting project and very doable. At the moment I have to exert some self discipline and not take on yet another side project. But I'll keep this in mind. Even after the ES.next ship has sailed this would be valuable.

@jwlrs
Copy link
Author

jwlrs commented Jul 9, 2013

This is a bit like what I proposed: http://sideeffect.kr/popularconvention/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment