Skip to content

Instantly share code, notes, and snippets.

@idan
Created September 24, 2012 22:13
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 1 You must be signed in to fork a gist
  • Save idan/3778760 to your computer and use it in GitHub Desktop.
Save idan/3778760 to your computer and use it in GitHub Desktop.

Visualizing Github

A treasure trove of data is captured daily by Github; it has become our shared consciousness of thoughts made code. What stories can that data tell us about how we think and work? How would one go about finding and telling those stories? This talk is a soup-to-nuts tour of practical data visualization with Python and web technologies, covering both the extraction and display of data in illumination of a familiar dataset.

Detailed Description

In the time that we have been crafting software, our collective efforts have never been cataloged neatly in one centralized location. Some projects have long developed in the open, and some have even exposed their development history in some form or another—but the connections between multiple projects remained hidden.

These connections between multiple developers and multiple projects are the glue that binds us together into larger developer communities—they are our mirror, and for the first time we can take a look at ourselves with the aid of the Github API, our favorite dynamic programming language, and standards-based web technologies.

Github provides the perfect case study in the practice of extracting and presenting meaning from data. Come watch us tell a story about telling new stories with a familiar dataset: the tools, the techniques, and the thinking behind our anthropological journey into the largest coding metacommunity.

Outline

Part I: Data to Information

Introduction

  • The art of storytelling when you don’t know the story ahead of time.
  • The seven stages of Data Visualization as per Ben Fry: Acquire, Parse, Filter, Mine, Represent, Refine, Interact.

Acquring, Parsing, Filtering

  • Data rarely comes neatly packaged.
  • The practicalities of getting data out of APIs. A brief tour of the data acquisition toolbox in python: (requests, celery, beautiful soup, pyparsing, ipython notebook. pandas?) Being a polite data-slurping netizen (optimizing data access by queries, dealing with rate limits).

Mining

  • The hard part: teasing out a story. Who is your audience? What would entertain and enlighten them? The essence of journalism.
  • Storing the data for display: how will data be queried? Does it even make sense to store it all in one kind of database? At scale, your data begins to look a lot like your presentation.

Part II: Information to Meaning

This talk picks up where part I left off, continuing the journey from filtered, structured information to a coherent visualization. The outline below lists the general points of theory, which are illustrated liberally with examples from our Github visualization project and other relevant experiences.

Finding a good representation for your data.

  • One story or many stories? Static vs. Interactive.
  • Avoiding simple charts and embracing a richer visual language: strategies and examples.
  • Guided vs. free exploration narratives.

The medium

  • Data visualization and the web: a brief history of tools. Static images, Processing/Nodebox, Processing.js, Flash, D3.js.
  • Constraints: the medium’s sweet spot, and audience considerations.
  • The temptation of display-on-hover and good workarounds for conditionally displaying more data on touch platforms.

The backend: serving up your data.

  • Storing your data for access: caching and denormalization.
  • Composing your API resources with your clients in mind. Make lazy loading possible, do everything to make initial view data compact and cached.
  • Precomputing: because client cycles are cheap but not free.
  • Django + Tastypie
  • JSON is not your only friend: UTFgrid and other creative ways to represent data.

The frontend: displaying your data

  • Marrying data to the DOM: wrapping your brain around D3.js.
  • The difference between visualization toolkits and charting libraries.
  • Multiresolution data: timeseries range-rollups, k-means clustering for maps and other planar data.
  • Tasteful uses of animation
  • Responsive data visualizations: a tough problem. Best practices. Responding to things other than screen size: time, location.
  • The geo stack: a brief tour of mapping technologies and the web. (Optional, it time permits).

Other information:

I'm Django's "Benevolent Designer for Life"; as a member of the core team I'm responsible for issues which touch on the needs of frontend developers as well as anything which can be improved through design.

As a designer/developer hybrid, I've spoken at three DjangoCons, including my keynote address at DjangoCon US 2011. I gave two very well-received talks at PyCon last, one of which was used as the basis for a new curriculum in high schools in Alaska. I've also spoken numerous times at local Python and web development meetups. I think I’m within bounds to say that I can deliver a fun and engaging experience to the PyCon audience.

My recorded talks and related materials are available on Lanyrd.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment