Skip to content

Instantly share code, notes, and snippets.

@danabauer
Forked from idan/gist:3778760
Created September 26, 2012 02:31
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 2 You must be signed in to fork a gist
  • Save danabauer/3785664 to your computer and use it in GitHub Desktop.
Save danabauer/3785664 to your computer and use it in GitHub Desktop.

Visualizing Github

A treasure trove of data is captured daily by Github; it has become our shared consciousness of thoughts made code. What stories can that data tell us about how we think, work, and interact? How would one go about finding and telling those stories? This talk is a soup-to-nuts tour of practical data visualization with Python and web technologies, covering both the extraction and display of data in illumination of a familiar dataset.

Detailed Description

In the time that we have been crafting software, our collective efforts have never been cataloged neatly in one centralized location. Some projects have long developed in the open, and some have even exposed their development history in some form or another—but the connections between multiple projects remained hidden.

These connections between multiple developers and multiple projects are the glue that binds us together into larger developer communities—they are our mirror, and for the first time we can take a look at ourselves with the aid of the Github API, our favorite dynamic programming language, and standards-based web technologies.

Github provides the perfect case study in the practice of extracting and presenting meaning from data. Come watch us tell a story about telling new stories with a familiar dataset: the tools, the techniques, and the thinking behind our anthropological journey into the largest coding metacommunity.

Outline

Part I: Data to Information

Introduction

This talk is about the process of turning raw data into information that I can use to tell a story. I'll introduce Ben Fry's Seven Stages of Visualizing Data to give some structure to my discussion of this process.

  1. First step: acquire data. Sources? Strategies for gathering or extracting data. Thinking about additional data sets to add context to the primary data set.
  2. Second step: Parse. Turn messy raw data into clean, structured data. Or pull data from one structure into another structure. The goal: a data structure that we can work with and begin to explore.
  3. Third step: Filter. At this point we can start to ask some questions of our data. These initial questions might form the basis for our story. Interview our data. Query our data.
  4. Fourth step: Mine. This is the fun part. The exploratory part of our data analysis. What patterns do we expect to see?

Acquring, Parsing, Filtering

  • This is the hard, sometimes unpleasant part of data visualization.
  • Data rarely comes neatly packaged.
  • The practicalities of getting data out of APIs. A brief tour of the data acquisition toolbox in python: (requests, celery, beautiful soup, pyparsing)
  • Tools for cleaning data (not all Python-based)
  • Being a polite data-slurping netizen (optimizing data access by queries, dealing with rate limits). (not sure about this)
  • Throughout this process, I'm usually asking questions of the data and hypothesizing about patterns I might find.

Mining

  • The fun part: teasing out a story. Who is your audience? What would entertain and enlighten them? The essence of journalism. Interview your data. Ask questions. This is an iterative process, especially if you're working with clients. They start to see initial patterns and they have lots of new questions about the data.
  • ipython notebook. pandas. Other non-Python tools for exploratory data analysis. IPython is great for sharing preliminary data analysis and preliminary visualizations of data with partners, clients, editors.
  • Storing the data for display: how will data be queried? Does it even make sense to store it all in one kind of database? At scale, your data begins to look a lot like your presentation.

Part II: Information to Meaning

This talk picks up where part I left off, continuing the journey from filtered, structured information to a coherent visualization. The outline below lists the general points of theory, which are illustrated liberally with examples from our Github visualization project and other relevant experiences.

Finding a good representation for your data.

  • One story or many stories? Static vs. Interactive.
  • Avoiding simple charts and embracing a richer visual language: strategies and examples.
  • Guided vs. free exploration narratives.

The medium

  • Data visualization and the web: a brief history of tools. Static images, Processing/Nodebox, Processing.js, Flash, D3.js.
  • Constraints: the medium’s sweet spot, and audience considerations.
  • The temptation of display-on-hover and good workarounds for conditionally displaying more data on touch platforms.

The backend: serving up your data.

  • Storing your data for access: caching and denormalization.
  • Composing your API resources with your clients in mind. Make lazy loading possible, do everything to make initial view data compact and cached.
  • Precomputing: because client cycles are cheap but not free.
  • Django + Tastypie
  • JSON is not your only friend: UTFgrid and other creative ways to represent data.

The frontend: displaying your data

  • Marrying data to the DOM: wrapping your brain around D3.js.
  • The difference between visualization toolkits and charting libraries.
  • Multiresolution data: timeseries range-rollups, k-means clustering for maps and other planar data.
  • Tasteful uses of animation
  • Responsive data visualizations: a tough problem. Best practices. Responding to things other than screen size: time, location.
  • The geo stack: a brief tour of mapping technologies and the web. (Optional, it time permits).

Other information:

Idan bio

I'm Django's "Benevolent Designer for Life"; as a member of the core team I'm responsible for issues which touch on the needs of frontend developers as well as anything which can be improved through design.

As a designer/developer hybrid, I've spoken at three DjangoCons, including my keynote address at DjangoCon US 2011. I gave two very well-received talks at PyCon last, one of which was used as the basis for a new curriculum in high schools in Alaska. I've also spoken numerous times at local Python and web development meetups. I think I’m within bounds to say that I can deliver a fun and engaging experience to the PyCon audience.

My recorded talks and related materials are available on Lanyrd.

Dana bio

I'm a mapmaker and data analyst, with interests in open data and journalism. My background in geography, statistics, and science writing has prepared me well for a job that I love: telling stories with data.

I've given talks on open data, data analysis, data visualization, and web mapping at several conferences, including THATCamp Philly and Esri UC. At PyCon 2011, I teamed up with Jackie Kazil to give a talk called Python for Open Data Lovers: Explore It, Analyze It, Map It.

As part of the PhillyPUG leadership team, I organize and teach workshops and project nights for new coders, with a focus on bringing more women into the Python community. I'm also a co-organizer of Hacks Hackers Philly, a journalist-technologist collaborative.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment