Skip to content

Instantly share code, notes, and snippets.

@idan
Forked from danabauer/gist:3785664
Created September 26, 2012 22:31
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save idan/3791042 to your computer and use it in GitHub Desktop.
Save idan/3791042 to your computer and use it in GitHub Desktop.

Visualizing Github

A treasure trove of data is captured daily by Github. What stories can that data tell us about how we think, work, and interact? How would one go about finding and telling those stories? This talk is a soup-to-nuts tour of practical data visualization with Python and web technologies, covering both the extraction and display of data in illumination of a familiar dataset.

Detailed Description

In the time that we have been crafting software, our collective efforts have never been cataloged neatly in one centralized location. Some projects have long developed in the open, and some have even exposed their development history in some form or another—but the connections between multiple projects remained hidden.

These connections between multiple developers and multiple projects are the glue that binds us together into larger developer communities—they are our mirror, and for the first time we can take a look at ourselves with the aid of the Github API, our favorite dynamic programming language, and standards-based web technologies.

Github provides the perfect case study in the practice of extracting and presenting meaning from data. Come watch us tell a story about telling new stories with a familiar dataset: the tools, the techniques, and the thinking behind our anthropological journey into the largest coding metacommunity.

This talk is being presented in two parts. In part I, Dana will cover the first half of the data visualization process: acquiring the data, cleaning it up, and working with it to tease out a story. In part II, Idan covers the presentational aspects: what to display, how best to display it, and interaction.

Outline

We're trying something new for Pycon: a double feature. Dana (a data scientist) and Idan (a designer/developer hybrid) are teaming up to provide a holistic introduction to data visualization with Python. Each talk covers different aspects of the process, but the two together provide a uniquely comprehensive tour of a hot topic for attendees, accessible for all levels.

This is one of two talks being submitted under the same title, as parts I and II. In part I, Dana will present the first half of the data visualization process: acquiring the data, cleaning it up, and working with it to tease out a story. In part II, Idan covers the presentational aspects: what to display, how best to display it, and interaction.

The talk outlines will necessarily focus on principles and “best practices”—but the contents of our talk will explain these principles and practices in the context of how they were applied to our visualization of Github’s data.

Logistical notes

Ideally, these talks will be scheduled back-to-back in the same room.

We'd prefer a 30-minute slot for part I, and a 45-minute slot for part II—however we realize that the logistics are tricky. If two 30-minute slots are all that are available, we'll make it work.

On our respective speaker preferences, we’ve listed that we aren’t interested in giving more than one talk. We’re listing both of us as speakers on both talks, so to clarify: we’re both interested in giving both of these talks, just not in giving these talks and another one (if either of us actually submits an additional talk).

Part I: Data to Information

This talk is a guide to the process of turning raw data into information that can be used to tell a story. Exposing relationships and testing hypotheses is easier today because we have better tools—both for working with the data and for collaborating with others on this process.

Ben Fry defined the seven stages of visualizing data: acquire, parse, filter, mine, represent, refine, interact. In part I, the focus is largely on the first four: the invariably messy process of acquiring data, cleaning it up, and working with it to identify a story.

Acquring, Parsing, Filtering (find a better section title)

  • Start by asking yourself what kind of story you want to tell
  • Identifying the primary data source and finding other data sets that will add context
  • Data rarely comes neatly packaged.
  • The practicalities of getting data out of APIs. A brief tour of the data acquisition toolbox in python: (requests, celery, beautiful soup, pyparsing)
  • Tools for cleaning data (not all Python-based)
  • Being a polite data-slurping netizen (optimizing data access by queries, dealing with rate limits). (not sure about this)
  • Turn messy raw data into clean, structured data. Or pull data from one structure into another structure. The goal: a data structure that we can work with and begin to explore.
  • Throughout this process, I'm usually asking questions of the data and hypothesizing about patterns I might find.

Mining (better section title)

  • The fun part: teasing out a story. Who is your audience? What would entertain and enlighten them? The essence of journalism. Interview your data. Ask questions. This is an iterative process, especially if you're working with clients. They start to see initial patterns and they have lots of new questions about the data.
  • ipython notebook. pandas. Other non-Python tools for exploratory data analysis. IPython is great for sharing preliminary data analysis and preliminary visualizations of data with partners, clients, editors.
  • Storing the data for display: how will data be queried? Does it even make sense to store it all in one kind of database? At scale, your data begins to look a lot like your presentation.

Part II: Information to Meaning

This talk picks up where part I left off, continuing the journey from filtered, structured information to a coherent visualization. The outline below lists the general points of theory, which are illustrated liberally with examples from our Github visualization project and other relevant experiences.

Finding a good representation for your data.

  • One story or many stories? Static vs. Interactive.
  • Avoiding simple charts and embracing a richer visual language: strategies and examples.
  • Guided vs. free exploration narratives.

The medium

  • Data visualization and the web: a brief history of tools. Static images, Processing/Nodebox, Processing.js, Flash, D3.js.
  • Constraints: the medium’s sweet spot, and audience considerations.
  • The temptation of display-on-hover and good workarounds for conditionally displaying more data on touch platforms.

The backend: serving up your data.

  • Storing your data for access: caching and denormalization.
  • Composing your API resources with your clients in mind. Make lazy loading possible, do everything to make initial view data compact and cached.
  • Precomputing: because client cycles are cheap but not free.
  • Django + Tastypie
  • JSON is not your only friend: UTFgrid and other creative ways to represent data.

The frontend: displaying your data

  • Marrying data to the DOM: wrapping your brain around D3.js.
  • The difference between visualization toolkits and charting libraries.
  • Multiresolution data: timeseries range-rollups, k-means clustering for maps and other planar data.
  • Tasteful uses of animation
  • Responsive data visualizations: a tough problem. Best practices. Responding to things other than screen size: time, location.
  • The geo stack: a brief tour of mapping technologies and the web. (Optional, it time permits).

Speaker Background

Idan Gazit

I'm Django's "Benevolent Designer for Life"; as a member of the core team I'm responsible for issues which touch on the needs of frontend developers as well as anything which can be improved through design.

As a designer/developer hybrid, I've spoken at three DjangoCons, including my keynote address at DjangoCon US 2011. I gave two very well-received talks at PyCon last, one of which was used as the basis for a new curriculum in high schools in Alaska. I've also spoken numerous times at local Python and web development meetups. I think I’m within bounds to say that I can deliver a fun and engaging experience to the PyCon audience.

My recorded talks and related materials are available on Lanyrd.

Dana Bauer

I'm a mapmaker and data analyst, with interests in open data and journalism. My background in geography, statistics, and science writing has prepared me well for a job that I love: telling stories with data.

I've given talks on open data, data analysis, data visualization, and web mapping at several conferences, including THATCamp Philly and ESRI UC. At PyCon 2011, I teamed up with Jacqueline Kazil to give a talk called Python for Open Data Lovers: Explore It, Analyze It, Map It.

As part of the PhillyPUG leadership team, I organize and teach workshops and project nights for new coders, with a focus on bringing more women into the Python community. I'm also a co-organizer of Hacks Hackers Philly, a journalist-technologist collaborative.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment