Created
April 8, 2010 17:37
-
-
Save ogrisel/360315 to your computer and use it in GitHub Desktop.
Counting incoming links in DBpedia with unix shell tools
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# Sample single piped shell unix commands to estimate the top 10 popular DBpedia resources | |
# by counting incoming links between matching wikipedia articles | |
% time curl http://downloads.dbpedia.org/3.5/en/page_links_en.nt.bz2 \ | |
| bzcat \ | |
| head -1000000 \ | |
| sed -e 's/.*\/\(.*\)> \./\1/' \ | |
| sort \ | |
| uniq -c \ | |
| sort -nr \ | |
| head -10 | |
% Total % Received % Xferd Average Speed Time Time Time Current | |
Dload Upload Total Spent Left Speed | |
0 833M 0 6518k 0 0 104k 0 2:16:26 0:01:02 2:15:24 115k | |
2302 United_States | |
1174 France | |
1122 United_Kingdom | |
1069 Germany | |
1013 World_War_II | |
750 England | |
744 Italy | |
647 Russia | |
635 Spain | |
634 Canada | |
curl http://downloads.dbpedia.org/3.5/en/page_links_en.nt.bz2 0.02s user 0.04s system 0% cpu 1:03.48 total | |
bzcat 6.66s user 0.12s system 10% cpu 1:03.46 total | |
head -1000000 0.07s user 0.12s system 0% cpu 1:03.46 total | |
sed -e 's/.*\/\(.*\)> \./\1/' 56.28s user 1.25s system 90% cpu 1:03.49 total | |
sort 5.66s user 0.09s system 8% cpu 1:05.91 total | |
uniq -c 0.48s user 0.02s system 0% cpu 1:05.91 total | |
sort -nr 2.33s user 0.03s system 3% cpu 1:08.24 total | |
head -10 0.00s user 0.00s system 0% cpu 1:08.24 total | |
# Explanations: | |
# curl: fetch triples: <from resource> <link> <to resource> . | |
# bzcat: decompress the bzip2 payload | |
# head: restrict download to 10e6 first links (good enough for estimate) | |
# sed: extract the "to" resource name (and drop the rest) | |
# sort: group resources alphabetically to count incoming links to them | |
# uniq: count consecutive incoming links to same resource | |
# sor: order by incoming link count | |
# head: only display the top 10 resources |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment