Skip to content

Instantly share code, notes, and snippets.

@tommorris
Created June 19, 2009 16:57
Show Gist options
  • Save tommorris/132729 to your computer and use it in GitHub Desktop.
Save tommorris/132729 to your computer and use it in GitHub Desktop.
I've just put up a large chunk of my Twitter archives up on the
Talis Platform service. Talis Platform is a 'cloud'-based
triplestore hosting service. More at http://n2.talis.com
A triplestore is like a database but for graphs of RDF triples.
The cool thing about RDF and the triplestore is that you
basically have a completely schema-less datastore. You don't have
to figure out "Oh, there's integers going in this field and
strings going in that". You just upload a big pile of RDF and the
triplestore keeps it all there. This is obviously not as efficient
as using a database, so if you want to grow to Google size, it may
not be the best solution. But because it's cloud-based I don't
have to think about that either - that's up to Talis! ;)
Turning Twitter data into RDF is pretty easy. The approach I found
easiest was to use the API which returns either XML or JSON.
I used XML as I have already got an XSLT stylesheet that does most
of the work.
Pre-requisites:
* a Unix-based OS
* curl
* xsltproc
* Ruby 1.8.6+ (or JRuby 1.3.0)
* nokogiri gem
I had old archive data from Twitter, back using the old archive method.
In that, tweets that are at-replies to other tweets only have the ID
of the other user, not the screen name. But the URI of tweets is
constructed from the screen name. You then need to look up the IF
using the /users/show.xml?user_id=(val) method. The code to do that
is in transform.rb
transform.rb is a bit of a lazy hack. If you run it over old archive
data, it WILL crash. that's because open-uri raises an exception when
it gets a 404 status. Silly really, as 404 is a perfectly valid status,
and is semantically meaningful. <http://twitter.com/tommorris>
returning 404 means there is no @tommorris on twitter. ;)
When it hit a 404, I took whatever number it returned and manually
grepped for it in the file, figured out who the at-reply was to and
then added that persons etails to the YAML file.
The XSLT used is below, but I recommend that if you want to do this
to wait a few days. I'm planning on rewriting the XSLT a bit soon to
make it suck less. The code is twitter-rdf.xsl
As for actually doing the transformations and loading them into the
Talis store, I used IRB (interactive Ruby shell) to invoke xsltproc
and curl.
irb> `ls *.xml`.split.each{|i| `xsltproc ~/Code/twitter-rdf.xsl #{i} > #{i.split('.')[0] + ".rdf"` }
irb> `ls *.rdf`.split.each{|i| `curl -v --digest -u "(username):(password)" --retry 10 --retry-delay 10 -H "Content-Type:application/rdf+xml" --data @#{i} http://api.talis.com/stores/(storename)/meta` }
require "rubygems"
require "nokogiri"
require "open-uri"
require "yaml"
def username_lookup(val, hash)
if hash[val].nil?
print "-- #{val}"
screenname = Nokogiri::XML(open("http://twitter.com/users/show.xml?user_id=#{val}").readlines.join).search("screen_name")[0].content.to_s
print " = #{screenname}\n"
hash[val] = screenname
sleep 30 # so as not to exceed the Twitter API limit
end
return hash[val]
end
(1..76).to_a.each do |f|
hash = YAML::load_file("/home/tom/twitter_usernames.yml")
puts "Processing #{f}.xml"
origarchive = Nokogiri::XML(open("/home/tom/twitter_archive/#{f.to_s}.xml").readlines.join)
origarchive.search("status").collect {|i| i if i.search("in_reply_to_user_id")[0].content != "" && i.search("in_reply_to_screen_name").size == 0 }. delete_if {|i| i.nil? }.collect {|i| screenname = username_lookup(i.search("in_reply_to_user_id")[0].content.to_s, hash); newnode = Nokogiri::XML:: Node.new("in_reply_to_screen_name", origarchive); newnode.content = screenname; i.search("in_reply_to_user_id")[0].add_next_sibling(newnode); i }
origarchive.root.write_to(File.open("/home/tom/twitter_archive/#{f.to_s}.xml", "w"))
File.open("/home/tom/twitter_usernames.yml", "w") do |out|
YAML.dump(hash, out)
end
end
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0"
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:foaf="http://xmlns.com/foaf/0.1/" xmlns:dcterms="http://purl.org/dc/terms/"
xmlns:sioc="http://rdfs.org/sioc/ns#" xmlns:twitter="http://rdf.opiumfield.com/twitter/0.1/"
xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#" xmlns:xsd="http://www.w3.org/2001/XMLSchema#">
<xsl:output method="xml" indent="yes" encoding="UTF-8" />
<xsl:template match="text()"/>
<xsl:param name="username"/>
<xsl:template match="users">
<rdf:RDF>
<rdf:Description rdf:about="">
<foaf:primaryTopic rdf:resource="http://twitter.com/{$username}"/>
</rdf:Description>
<foaf:Agent rdf:about="http://twitter.com/{$username}">
<xsl:apply-templates select="user" mode="link"/>
<rdfs:seeAlso rdf:resource="http://tools.opiumfield.com/twitter/{$username}/rdf/history" />
</foaf:Agent>
<xsl:apply-templates select="user" mode="details"/>
</rdf:RDF>
</xsl:template>
<xsl:template match="user" mode="link">
<foaf:knows rdf:resource="http://twitter.com/{screen_name}"/>
</xsl:template>
<xsl:template match="user" mode="details">
<foaf:Agent rdf:about="http://twitter.com/{screen_name}">
<foaf:nick>
<xsl:value-of select="screen_name"/>
</foaf:nick>
<foaf:name>
<xsl:value-of select="name"/>
</foaf:name>
<xsl:if test="string-length(url) &gt; 0">
<foaf:homepage rdf:resource="{url}"/>
</xsl:if>
<xsl:if test="status">
<foaf:made rdf:resource="http://twitter.com/{screen_name}/{status/id}" />
</xsl:if>
<rdfs:seeAlso rdf:resource="http://tools.opiumfield.com/twitter/{screen_name}/rdf"/>
<rdfs:seeAlso rdf:resource="http://tools.opiumfield.com/twitter/{screen_name}/rdf/history" />
</foaf:Agent>
<xsl:if test="status">
<xsl:apply-templates select="status" />
</xsl:if>
</xsl:template>
<xsl:template match="statuses[@type='array']">
<rdf:RDF>
<xsl:apply-templates select="status" />
</rdf:RDF>
</xsl:template>
<xsl:template match="status">
<xsl:variable name="screen_name">
<xsl:choose>
<xsl:when test="../screen_name">
<xsl:value-of select="../screen_name" />
</xsl:when>
<xsl:when test="user/screen_name">
<xsl:value-of select="user/screen_name" />
</xsl:when>
</xsl:choose>
</xsl:variable>
<sioc:Post rdf:about="http://twitter.com/{$screen_name}/statuses/{id}">
<rdf:type rdf:resource="http://rdfs.org/sioc/types#MicroblogPost" />
<sioc:content xml:lang="en">
<xsl:value-of select="text"/>
</sioc:content>
<xsl:if test="truncated/text() = 'true'">
<twitter:truncated rdf:datatype="http://www.w3.org/2001/XMLSchema#boolean">true</twitter:truncated>
</xsl:if>
<xsl:choose>
<xsl:when test="in_reply_to_screen_name/text() != '' and in_reply_to_status_id/text() != ''">
<sioc:reply_to>
<sioc:Post rdf:about="http://twitter.com/{normalize-space(in_reply_to_screen_name/text())}/status/{normalize-space(in_reply_to_status_id/text())}">
<foaf:maker>
<foaf:Agent>
<foaf:weblog rdf:resource="http://twitter.com/{normalize-space(in_reply_to_screen_name/text())}" />
</foaf:Agent>
</foaf:maker>
</sioc:Post>
</sioc:reply_to>
</xsl:when>
<xsl:when test="in_reply_to_screen_name/text() != ''">
<sioc:reply_to>
<rdf:Description>
<foaf:maker>
<foaf:Agent>
<foaf:weblog rdf:resource="http://twitter.com/{normalize-space(in_reply_to_screen_name/text())}" />
</foaf:Agent>
</foaf:maker>
</rdf:Description>
</sioc:reply_to>
</xsl:when>
</xsl:choose>
<dcterms:created rdf:datatype="http://www.w3.org/2001/XMLSchema#dateTime">
<xsl:value-of select="substring(created_at, 27, 4)"/>
<xsl:text>-</xsl:text>
<xsl:if test="substring(created_at, 5, 3) = 'Jan'">
<xsl:text>01</xsl:text>
</xsl:if>
<xsl:if test="substring(created_at, 5, 3) = 'Feb'">
<xsl:text>02</xsl:text>
</xsl:if>
<xsl:if test="substring(created_at, 5, 3) = 'Mar'">
<xsl:text>03</xsl:text>
</xsl:if>
<xsl:if test="substring(created_at, 5, 3) = 'Apr'">
<xsl:text>04</xsl:text>
</xsl:if>
<xsl:if test="substring(created_at, 5, 3) = 'May'">
<xsl:text>05</xsl:text>
</xsl:if>
<xsl:if test="substring(created_at, 5, 3) = 'Jun'">
<xsl:text>06</xsl:text>
</xsl:if>
<xsl:if test="substring(created_at, 5, 3) = 'Jul'">
<xsl:text>07</xsl:text>
</xsl:if>
<xsl:if test="substring(created_at, 5, 3) = 'Aug'">
<xsl:text>08</xsl:text>
</xsl:if>
<xsl:if test="substring(created_at, 5, 3) = 'Sep'">
<xsl:text>09</xsl:text>
</xsl:if>
<xsl:if test="substring(created_at, 5, 3) = 'Oct'">
<xsl:text>10</xsl:text>
</xsl:if>
<xsl:if test="substring(created_at, 5, 3) = 'Nov'">
<xsl:text>11</xsl:text>
</xsl:if>
<xsl:if test="substring(created_at, 5, 3) = 'Dec'">
<xsl:text>12</xsl:text>
</xsl:if>
<xsl:text>-</xsl:text>
<xsl:value-of select="substring(created_at, 9, 2)"/>
<xsl:text>T</xsl:text>
<xsl:value-of select="substring(created_at, 12, 8)"/>
<xsl:text>Z</xsl:text>
</dcterms:created>
<dcterms:source rdf:resource="http://twitter.com/{$screen_name}"/>
<foaf:maker>
<foaf:Agent>
<foaf:weblog rdf:resource="http://twitter.com/{$screen_name}"/>
<xsl:if test="user/url/text() != ''">
<foaf:homepage rdf:resource="{user/url}" />
</xsl:if>
</foaf:Agent>
</foaf:maker>
</sioc:Post>
</xsl:template>
</xsl:stylesheet>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment