tommorris (owner)

Forks

  • gist: 133848 by sr bookmarked created Mon Jun 22 00:37:47 -0700 2009

Revisions

gist: 132729 Download_button fork
public
Public Clone URL: git://gist.github.com/132729.git
Embed All Files: show embed
README #
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
I've just put up a large chunk of my Twitter archives up on the
Talis Platform service. Talis Platform is a 'cloud'-based
triplestore hosting service. More at http://n2.talis.com
A triplestore is like a database but for graphs of RDF triples.
The cool thing about RDF and the triplestore is that you
basically have a completely schema-less datastore. You don't have
to figure out "Oh, there's integers going in this field and
strings going in that". You just upload a big pile of RDF and the
triplestore keeps it all there. This is obviously not as efficient
as using a database, so if you want to grow to Google size, it may
not be the best solution. But because it's cloud-based I don't
have to think about that either - that's up to Talis! ;)
 
Turning Twitter data into RDF is pretty easy. The approach I found
easiest was to use the API which returns either XML or JSON.
I used XML as I have already got an XSLT stylesheet that does most
of the work.
 
Pre-requisites:
* a Unix-based OS
* curl
* xsltproc
* Ruby 1.8.6+ (or JRuby 1.3.0)
  * nokogiri gem
 
I had old archive data from Twitter, back using the old archive method.
In that, tweets that are at-replies to other tweets only have the ID
of the other user, not the screen name. But the URI of tweets is
constructed from the screen name. You then need to look up the IF
using the /users/show.xml?user_id=(val) method. The code to do that
is in transform.rb
 
transform.rb is a bit of a lazy hack. If you run it over old archive
data, it WILL crash. that's because open-uri raises an exception when
it gets a 404 status. Silly really, as 404 is a perfectly valid status,
and is semantically meaningful. <http://twitter.com/tommorris>
returning 404 means there is no @tommorris on twitter. ;)
When it hit a 404, I took whatever number it returned and manually
grepped for it in the file, figured out who the at-reply was to and
then added that persons etails to the YAML file.
 
The XSLT used is below, but I recommend that if you want to do this
to wait a few days. I'm planning on rewriting the XSLT a bit soon to
make it suck less. The code is twitter-rdf.xsl
 
As for actually doing the transformations and loading them into the
Talis store, I used IRB (interactive Ruby shell) to invoke xsltproc
and curl.
 
irb> `ls *.xml`.split.each{|i| `xsltproc ~/Code/twitter-rdf.xsl #{i} > #{i.split('.')[0] + ".rdf"` }
irb> `ls *.rdf`.split.each{|i| `curl -v --digest -u "(username):(password)" --retry 10 --retry-delay 10 -H "Content-Type:application/rdf+xml" --data @#{i} http://api.talis.com/stores/(storename)/meta` }
transform.rb #
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
require "rubygems"
require "nokogiri"
require "open-uri"
require "yaml"
 
def username_lookup(val, hash)
  if hash[val].nil?
    print "-- #{val}"
    screenname = Nokogiri::XML(open("http://twitter.com/users/show.xml?user_id=#{val}").readlines.join).search("screen_name")[0].content.to_s
    print " = #{screenname}\n"
    hash[val] = screenname
    sleep 30 # so as not to exceed the Twitter API limit
  end
  return hash[val]
end
 
(1..76).to_a.each do |f|
  hash = YAML::load_file("/home/tom/twitter_usernames.yml")
  puts "Processing #{f}.xml"
  origarchive = Nokogiri::XML(open("/home/tom/twitter_archive/#{f.to_s}.xml").readlines.join)
  origarchive.search("status").collect {|i| i if i.search("in_reply_to_user_id")[0].content != "" && i.search("in_reply_to_screen_name").size == 0 }. delete_if {|i| i.nil? }.collect {|i| screenname = username_lookup(i.search("in_reply_to_user_id")[0].content.to_s, hash); newnode = Nokogiri::XML:: Node.new("in_reply_to_screen_name", origarchive); newnode.content = screenname; i.search("in_reply_to_user_id")[0].add_next_sibling(newnode); i }
  origarchive.root.write_to(File.open("/home/tom/twitter_archive/#{f.to_s}.xml", "w"))
  File.open("/home/tom/twitter_usernames.yml", "w") do |out|
        YAML.dump(hash, out)
  end
end
twitter-rdf.xsl #
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0"
    xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:foaf="http://xmlns.com/foaf/0.1/" xmlns:dcterms="http://purl.org/dc/terms/"
    xmlns:sioc="http://rdfs.org/sioc/ns#" xmlns:twitter="http://rdf.opiumfield.com/twitter/0.1/"
    xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#" xmlns:xsd="http://www.w3.org/2001/XMLSchema#">
    <xsl:output method="xml" indent="yes" encoding="UTF-8" />
    <xsl:template match="text()"/>
    <xsl:param name="username"/>
    <xsl:template match="users">
        <rdf:RDF>
            <rdf:Description rdf:about="">
                <foaf:primaryTopic rdf:resource="http://twitter.com/{$username}"/>
            </rdf:Description>
            <foaf:Agent rdf:about="http://twitter.com/{$username}">
                <xsl:apply-templates select="user" mode="link"/>
<rdfs:seeAlso rdf:resource="http://tools.opiumfield.com/twitter/{$username}/rdf/history" />
            </foaf:Agent>
            <xsl:apply-templates select="user" mode="details"/>
        </rdf:RDF>
    </xsl:template>
    <xsl:template match="user" mode="link">
        <foaf:knows rdf:resource="http://twitter.com/{screen_name}"/>
    </xsl:template>
    <xsl:template match="user" mode="details">
        <foaf:Agent rdf:about="http://twitter.com/{screen_name}">
            <foaf:nick>
                <xsl:value-of select="screen_name"/>
            </foaf:nick>
            <foaf:name>
                <xsl:value-of select="name"/>
            </foaf:name>
            <xsl:if test="string-length(url) &gt; 0">
                <foaf:homepage rdf:resource="{url}"/>
            </xsl:if>
<xsl:if test="status">
<foaf:made rdf:resource="http://twitter.com/{screen_name}/{status/id}" />
</xsl:if>
            <rdfs:seeAlso rdf:resource="http://tools.opiumfield.com/twitter/{screen_name}/rdf"/>
<rdfs:seeAlso rdf:resource="http://tools.opiumfield.com/twitter/{screen_name}/rdf/history" />
        </foaf:Agent>
        <xsl:if test="status">
            <xsl:apply-templates select="status" />
        </xsl:if>
    </xsl:template>
    <xsl:template match="statuses[@type='array']">
<rdf:RDF>
<xsl:apply-templates select="status" />
</rdf:RDF>
    </xsl:template>
    <xsl:template match="status">
        <xsl:variable name="screen_name">
            <xsl:choose>
                <xsl:when test="../screen_name">
                    <xsl:value-of select="../screen_name" />
                </xsl:when>
                <xsl:when test="user/screen_name">
                    <xsl:value-of select="user/screen_name" />
                </xsl:when>
            </xsl:choose>
        </xsl:variable>
        <sioc:Post rdf:about="http://twitter.com/{$screen_name}/statuses/{id}">
<rdf:type rdf:resource="http://rdfs.org/sioc/types#MicroblogPost" />
            <sioc:content xml:lang="en">
                <xsl:value-of select="text"/>
            </sioc:content>
<xsl:if test="truncated/text() = 'true'">
<twitter:truncated rdf:datatype="http://www.w3.org/2001/XMLSchema#boolean">true</twitter:truncated>
</xsl:if>
        <xsl:choose>
            <xsl:when test="in_reply_to_screen_name/text() != '' and in_reply_to_status_id/text() != ''">
                <sioc:reply_to>
                    <sioc:Post rdf:about="http://twitter.com/{normalize-space(in_reply_to_screen_name/text())}/status/{normalize-space(in_reply_to_status_id/text())}">
                        <foaf:maker>
                            <foaf:Agent>
                                <foaf:weblog rdf:resource="http://twitter.com/{normalize-space(in_reply_to_screen_name/text())}" />
                            </foaf:Agent>
                        </foaf:maker>
                    </sioc:Post>
                </sioc:reply_to>
            </xsl:when>
            <xsl:when test="in_reply_to_screen_name/text() != ''">
                <sioc:reply_to>
                    <rdf:Description>
                        <foaf:maker>
                            <foaf:Agent>
                                <foaf:weblog rdf:resource="http://twitter.com/{normalize-space(in_reply_to_screen_name/text())}" />
                            </foaf:Agent>
                        </foaf:maker>
                    </rdf:Description>
                </sioc:reply_to>
            </xsl:when>
        </xsl:choose>
<dcterms:created rdf:datatype="http://www.w3.org/2001/XMLSchema#dateTime">
                <xsl:value-of select="substring(created_at, 27, 4)"/>
                <xsl:text>-</xsl:text>
                <xsl:if test="substring(created_at, 5, 3) = 'Jan'">
                    <xsl:text>01</xsl:text>
                </xsl:if>
                <xsl:if test="substring(created_at, 5, 3) = 'Feb'">
                    <xsl:text>02</xsl:text>
                </xsl:if>
                <xsl:if test="substring(created_at, 5, 3) = 'Mar'">
                    <xsl:text>03</xsl:text>
                </xsl:if>
                <xsl:if test="substring(created_at, 5, 3) = 'Apr'">
                    <xsl:text>04</xsl:text>
                </xsl:if>
                <xsl:if test="substring(created_at, 5, 3) = 'May'">
                    <xsl:text>05</xsl:text>
                </xsl:if>
                <xsl:if test="substring(created_at, 5, 3) = 'Jun'">
                    <xsl:text>06</xsl:text>
                </xsl:if>
                <xsl:if test="substring(created_at, 5, 3) = 'Jul'">
                    <xsl:text>07</xsl:text>
                </xsl:if>
                <xsl:if test="substring(created_at, 5, 3) = 'Aug'">
                    <xsl:text>08</xsl:text>
                </xsl:if>
                <xsl:if test="substring(created_at, 5, 3) = 'Sep'">
                    <xsl:text>09</xsl:text>
                </xsl:if>
                <xsl:if test="substring(created_at, 5, 3) = 'Oct'">
                    <xsl:text>10</xsl:text>
                </xsl:if>
                <xsl:if test="substring(created_at, 5, 3) = 'Nov'">
                    <xsl:text>11</xsl:text>
                </xsl:if>
                <xsl:if test="substring(created_at, 5, 3) = 'Dec'">
                    <xsl:text>12</xsl:text>
                </xsl:if>
                <xsl:text>-</xsl:text>
                <xsl:value-of select="substring(created_at, 9, 2)"/>
                <xsl:text>T</xsl:text>
                <xsl:value-of select="substring(created_at, 12, 8)"/>
                <xsl:text>Z</xsl:text>
</dcterms:created>
            <dcterms:source rdf:resource="http://twitter.com/{$screen_name}"/>
            <foaf:maker>
                <foaf:Agent>
                    <foaf:weblog rdf:resource="http://twitter.com/{$screen_name}"/>
                    <xsl:if test="user/url/text() != ''">
                        <foaf:homepage rdf:resource="{user/url}" />
                    </xsl:if>
                </foaf:Agent>
            </foaf:maker>
        </sioc:Post>
    </xsl:template>
</xsl:stylesheet>