Skip to content

Instantly share code, notes, and snippets.

@sebastian-nagel
Last active September 28, 2020 13:38
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save sebastian-nagel/95a086d7e8c76adc81d647258926e281 to your computer and use it in GitHub Desktop.
Save sebastian-nagel/95a086d7e8c76adc81d647258926e281 to your computer and use it in GitHub Desktop.
webgraph commands
### Jython
# install Jython (see https://www.jython.org/download)
wget https://repo1.maven.org/maven2/org/python/jython-standalone/2.7.2/jython-standalone-2.7.2.jar
# clone pywebgraph (fork with modifications)
git clone https://github.com/commoncrawl/py-web-graph.git
cd py-web-graph
# copy console.py into current working directory so that "pywebgraph" is visible as package
cp pywebgraph/console.py .
# $WG_CP must hold the class path for the webgraph package,
# cf. the script run_webgraph.sh
# Note: use `;' as separator on Windows
DIR=$PWD # or modify if webgraph is installed in a different location
WG_CP=$DIR/webgraph-$WEBGRAPH_VERSION.jar:$(ls $DIR/deps/*.jar | tr '\n' ':')
# Note:
# - need enough Java heap space (-Xmx...) to load large graphs
# - adapt path pointing to Python executable
java -Xmx12g -Dpython.console=org.python.util.JLineConsole -Dpython.executable=/bin/python2.7 -cp $WG_CP:../jython-standalone-2.7.2.jar: org.python.util.jython console.py
pyWebGraph console, Copyright (C) 2009 Massimo Santini
>>
>> graph .../cc-main-2020-feb-mar-may-domain
>> pwn
#0
>> namemaps cc-main-2020-feb-mar-may-domain
>> cn "org.commoncrawl"
>> pwn
#76320850 org.commoncrawl
>> ls
0: #1143797 au.com.dejanseo
1: #1644111 au.com.spatialsource
2: #2474905 be.youtu
...
190: #79223837 org.whatwg
191: #79235545 org.wikimedia
192: #79235849 org.wikipedia
193: #79236154 org.wikireverse
194: #80908556 re.slidesha
195: #87060356 uk.co.bbc
196: #89576687 us.lumeno
>> sl
0: #69452 ai.botxo
1: #74455 ai.kritikalvision
...
675: #89325941 uk.org.pigsonthewing
676: #89427893 uk.webxtrakt
677: #89611311 us.onehack
678: #89622268 us.pingpong
679: #89702797 us.zillman
680: #89834247 vn.avnuc
681: #90129490 wiki.sysadmin
682: #90203492 work.yokonoji
# see
# https://github.com/commoncrawl/cc-webgraph
# http://webgraph.di.unimi.it/
# http://law.di.unimi.it/tutorial.php
# download domain-level graphs
# https://commoncrawl.org/2020/06/host-and-domain-level-web-graphs-febmarmay-2020/
for f in cc-main-2020-feb-mar-may-domain-t.graph cc-main-2020-feb-mar-may-domain-t.properties \
cc-main-2020-feb-mar-may-domain.graph cc-main-2020-feb-mar-may-domain.properties \
cc-main-2020-feb-mar-may-domain.stats \
cc-main-2020-feb-mar-may-domain-edges.txt.gz; do
aws --no-sign-request s3 cp s3://commoncrawl/projects/hyperlinkgraph/cc-main-2020-feb-mar-may/domain/$f .;
done
# variable to execute jave with all webgraph jars on the class path
WG="cc-webgraph/src/script/webgraph_ranking/run_webgraph.sh"
# generate offset files *.offset
$WG it.unimi.dsi.webgraph.BVGraph -O -L cc-main-2020-feb-mar-may-domain
# also for the transpose of the graph
$WG it.unimi.dsi.webgraph.BVGraph -O -L cc-main-2020-feb-mar-may-domain-t
# generate connected components: weakly *.wcc, *.wccsizes, and strongly *.scc, *.sccsizes
$WG it.unimi.dsi.webgraph.algo.ConnectedComponents -m --renumber --sizes -t cc-main-2020-feb-mar-may-domain-t cc-main-2020-feb-mar-may-domain
$WG it.unimi.dsi.webgraph.algo.StronglyConnectedComponents --renumber --sizes cc-main-2020-feb-mar-may-domain
# generate statistics and degrees
# Note: if connected components files are present, these are used for statistics
# --save-degrees makes Stats generate text files holding
# - the number of degrees per node: *.outdegrees resp. *.indegrees
# - the frequency distributions of degrees: *.outdegree resp. *.indegree
# see also http://webgraph.di.unimi.it/docs/it/unimi/dsi/webgraph/Stats.html
$WG it.unimi.dsi.webgraph.Stats --save-degrees cc-main-2020-feb-mar-may-domain
# join degrees and node names
zcat cc-main-2020-feb-mar-may-domain-vertices.txt.gz \
| paste - cc-main-2020-feb-mar-may-domain.outdegrees cc-main-2020-feb-mar-may-domain.indegrees \
| gzip >cc-main-2020-feb-mar-may-domain-vertices-out-indegrees.txt.gz
# grep 4 nodes (domains)
zgrep -P '\t(com\.(google|facebook)|org\.(commoncrawl|wikipedia))\t' cc-main-2020-feb-mar-may-domain-vertices-out-indegrees.txt.gz
#id name numhosts outdegr indegrees
#22962984 com.facebook 5079 24 18359247
#25338395 com.google 2902 219754 15844122
#76320850 org.commoncrawl 4 197 683
#79235849 org.wikipedia 1845 1902491 2480862
#### String maps (see section "Rebuilding string maps" of http://law.di.unimi.it/tutorial.php)
# input is the second column of the vertices file containing the reversed domain names
# - the mapping of names to node numbers *.mph, see
# [sux4j mph package](http://sux4j.di.unimi.it/docs/it/unimi/dsi/sux4j/mph/package-summary.html) (minimal perfect hash)
zcat cc-main-2020-feb-mar-may-domain-vertices.txt.gz \
| cut -f2 \
| $WG it.unimi.dsi.sux4j.mph.GOV4Function cc-main-2020-feb-mar-may-domain.mph -
# - the string map *.smph (allows to verify whether a node name is present)
zcat cc-main-2020-feb-mar-may-domain-vertices.txt.gz \
| cut -f2 \
| $WG it.unimi.dsi.util.ShiftAddXorSignedStringMap cc-main-2020-feb-mar-may-domain.mph cc-main-2020-feb-mar-may-domain.smph
# - the front-coded list *.fcl
zcat cc-main-2020-feb-mar-may-domain-vertices.txt.gz \
| cut -f2 \
| $WG it.unimi.dsi.util.FrontCodedStringList -u -r 32 cc-main-2020-feb-mar-may-domain.fcl
# Note: building the [immutable external prefix map](http://dsiutils.di.unimi.it/docs/it/unimi/dsi/util/ImmutableExternalPrefixMap.html) *.iepm
# which would allow to map back and forth node names to numbers fails for the domain graph because sorting by domain hierarchy
# conflicts with lexical sorting:
# zcat cc-main-2020-feb-mar-may-domain-vertices.txt.gz \
# | cut -f2 \
# | $WG it.unimi.dsi.util.ImmutableExternalPrefixMap -b4Ki cc-main-2020-feb-mar-may-domain.iepm
# Exception in thread "main" java.lang.IllegalArgumentException: The provided term collection is not sorted [ac.alpha-seminar, ac.alpha]
@sebastian-nagel
Copy link
Author

However, loading the graph failed. Need to figure out why.

This was actually because of too little Java heap space which caused "java.lang.RuntimeException: java.lang.reflect.InvocationTargetException
" - not very helpful to figure out the reason (should print the complete stack trace to see the "caused by"?).

Need enough heap space: -Xmx12g to load graph and node name map.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment