At Neo4j, we love going to
This year, there is a DYI Schedule page that we are very intrigued by, so let’s build a — graph of the conference!
First, we download the excellent JQ utility, it’s a single binary, in my case I downloaded it to my current directory, ./jq
. Now, the www.oscon.com/oscon2014[OSCON] organizers are so awesome to publish the full schedule as a JSON feed from the OSCON website - thank you! We can parse this directly from the feed via curl
and jq
and pipe the the resulting lines into 3 different CSV files for Neo4j import.
It turns out that the conference has over 350 speakers, 53 venues and 470+ talks!
The data feed can be downloaded and looks something like
{ "Schedule": {
"conferences": [{"serial": 115 }],
"events": [
{
"serial": 33451,
"name": "Migrating to the Web Using Dart and Polymer - A Guide for Legacy OOP Developers",
"event_type": "40-minute conference session",
"time_start": "2014-07-23 17:00:00",
"time_stop": "2014-07-23 17:40:00",
"venue_serial": 1458,
"description": "The web development platform is massive. With tons of libraries, frameworks and concepts out there, it might be daunting for the "legacy" developer to jump into it.\r\n\r\nIn this presentation we will introduce Google Dart & Polymer. Two hot technologies that work in harmony to create powerful web applications using concepts familiar to OOP developers.",
"website_url": "http://oscon.com/oscon2014/public/schedule/detail/33451",
"speakers": [149868],
"categories": [
"Emerging Languages"
]
},
...
We are very pragmatic in this little post and define just three node Labels: Venue
, Speaker
and Event
, following the OSCON data structure:
Now, using the awesome jq
utility, we can easily filter out the relevant bits for our import, most notably the serial numbers of event
, venue
and speaker
which then are cross-referenced in the various parts. We also add a header line to each CSV file for convenience.
#speakers
curl 'http://www.oreilly.com/pub/sc/osconfeed' | ./jq -r '["serial", "name", "photo","url", "twitter"],(.Schedule.speakers[] | [.serial,.name, .photo, .url, .twitter]) | @csv' > speakers.csv
#venues
curl 'http://www.oreilly.com/pub/sc/osconfeed' | ./jq -r '["serial", "name"],(.Schedule.venues[] | [.serial,.name]) | @csv' > venues.csv
#events
curl 'http://www.oreilly.com/pub/sc/osconfeed' | ./jq -r '["serial", "name", "time_start","time_end", "venue_serial", "speakers"],(.Schedule.events[] | [.serial,.name, .time_start, .time_stop, .venue_serial, .speakers | if (. | type) == "null" then "" else (. | tostring | ltrimstr("[") | rtrimstr("]")) end]) | @csv' > events.csv
Resulting in e.g. a speaker.csv
, venues.csv
and events.csv
like that are uploaded for convenience to into this gist and used below with their RAW URLs.
A you can see, in the last case, we have to make sure the empty values, empty arrays and serialization into a parseable string from an integer array are taken care of in jq
, resulting in the gnarly .speakers | if (. | type) == "null" then "" else (. | tostring | ltrimstr("[") | rtrimstr("]")) end]
construct. Still, it’s three one-liners and at least for me impressively compact and readable, resulting in files like event.json:
"serial","name","time_start","time_end","venue_serial","speakers"
"33451","Migrating to the Web Using Dart and Polymer - A Guide for Legacy OOP Developers","2014-07-23 17:00:00","2014-07-23 17:40:00","1458","149868"
...
"34281","Erlang, LFE, Joxa and Elixir: Established and Emerging Languages in the Erlang Ecosystem","2014-07-23 16:10:00","2014-07-23 16:50:00","1456","172990"
"34283","Obey the Testing Goat! TDD for Web Development with Python","2014-07-21 09:00:00","2014-07-21 12:30:00","1450","173004"
"34285","Idioms for Building Distributed Fault-tolerant Applications with Elixir","2014-07-23 17:00:00","2014-07-23 17:40:00","1454","76735"
"34289","Lessons from Girl Develop It: Getting More Women Involved in Open Source","2014-07-23 10:40:00","2014-07-23 11:20:00","1462","169992,173025"
Now we have nicely formatted CSV
files with headers that we can import into Neo4j.
From here, there is built-in support in the standard Neo4j Cypher language for importing files. Loading venues and speakers is just a matter of iterating over the lines and using LOAD CSV:
LOAD CSV WITH headers FROM "https://gist.githubusercontent.com/peterneubauer/818241772095b8c9d41c/raw/9b7c26964277d052f872021a837240995f4927c5/speakers.csv" as line
CREATE (speaker:Speaker{serial:line.serial, name:line.name, photo:line.photo, twitter:line.twitter})
LOAD CSV WITH headers FROM "https://gist.githubusercontent.com/peterneubauer/818241772095b8c9d41c/raw/0d5554825327ad5da7c1ccb30a88625d843a5997/venues.csv" as line
CREATE (speaker:Venue{serial:line.serial, name:line.name})
For the events, it is a bit more involved since one event can have several speakers. We are thus creating one venue
LOAD CSV WITH headers FROM "https://gist.githubusercontent.com/peterneubauer/818241772095b8c9d41c/raw/5e98097170118f185888782b28fd3647f91b2509/events" as line
MERGE (venue: Venue{serial:line.venue_serial})
CREATE (event:Event{serial:line.serial, name:line.name, time_start: line.time_start, time_end: line.time_end})-[:AT_VENUE]->(venue)
WITH event, line
WHERE line.speakers <> ""
FOREACH (speaker_serial in split(line.speakers, ",") |
MERGE (speaker:Speaker{serial:speaker_serial})
CREATE (event)<-[:SPEAKS_AT]-(speaker))
Now we can easily search for some interesting stuff, for instance what talks the speaker Dan Allan is giving by looking up his twitter name:
MATCH (dan:Speaker{twitter:"mojavelinux"})-[:SPEAKS_AT]->(talk)-[:AT_VENUE]->(venue)
RETURN dan, talk, venue