Last active
September 28, 2020 13:38
-
-
Save sebastian-nagel/95a086d7e8c76adc81d647258926e281 to your computer and use it in GitHub Desktop.
webgraph commands
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
### Jython | |
# install Jython (see https://www.jython.org/download) | |
wget https://repo1.maven.org/maven2/org/python/jython-standalone/2.7.2/jython-standalone-2.7.2.jar | |
# clone pywebgraph (fork with modifications) | |
git clone https://github.com/commoncrawl/py-web-graph.git | |
cd py-web-graph | |
# copy console.py into current working directory so that "pywebgraph" is visible as package | |
cp pywebgraph/console.py . | |
# $WG_CP must hold the class path for the webgraph package, | |
# cf. the script run_webgraph.sh | |
# Note: use `;' as separator on Windows | |
DIR=$PWD # or modify if webgraph is installed in a different location | |
WG_CP=$DIR/webgraph-$WEBGRAPH_VERSION.jar:$(ls $DIR/deps/*.jar | tr '\n' ':') | |
# Note: | |
# - need enough Java heap space (-Xmx...) to load large graphs | |
# - adapt path pointing to Python executable | |
java -Xmx12g -Dpython.console=org.python.util.JLineConsole -Dpython.executable=/bin/python2.7 -cp $WG_CP:../jython-standalone-2.7.2.jar: org.python.util.jython console.py | |
pyWebGraph console, Copyright (C) 2009 Massimo Santini | |
>> | |
>> graph .../cc-main-2020-feb-mar-may-domain | |
>> pwn | |
#0 | |
>> namemaps cc-main-2020-feb-mar-may-domain | |
>> cn "org.commoncrawl" | |
>> pwn | |
#76320850 org.commoncrawl | |
>> ls | |
0: #1143797 au.com.dejanseo | |
1: #1644111 au.com.spatialsource | |
2: #2474905 be.youtu | |
... | |
190: #79223837 org.whatwg | |
191: #79235545 org.wikimedia | |
192: #79235849 org.wikipedia | |
193: #79236154 org.wikireverse | |
194: #80908556 re.slidesha | |
195: #87060356 uk.co.bbc | |
196: #89576687 us.lumeno | |
>> sl | |
0: #69452 ai.botxo | |
1: #74455 ai.kritikalvision | |
... | |
675: #89325941 uk.org.pigsonthewing | |
676: #89427893 uk.webxtrakt | |
677: #89611311 us.onehack | |
678: #89622268 us.pingpong | |
679: #89702797 us.zillman | |
680: #89834247 vn.avnuc | |
681: #90129490 wiki.sysadmin | |
682: #90203492 work.yokonoji | |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# see | |
# https://github.com/commoncrawl/cc-webgraph | |
# http://webgraph.di.unimi.it/ | |
# http://law.di.unimi.it/tutorial.php | |
# download domain-level graphs | |
# https://commoncrawl.org/2020/06/host-and-domain-level-web-graphs-febmarmay-2020/ | |
for f in cc-main-2020-feb-mar-may-domain-t.graph cc-main-2020-feb-mar-may-domain-t.properties \ | |
cc-main-2020-feb-mar-may-domain.graph cc-main-2020-feb-mar-may-domain.properties \ | |
cc-main-2020-feb-mar-may-domain.stats \ | |
cc-main-2020-feb-mar-may-domain-edges.txt.gz; do | |
aws --no-sign-request s3 cp s3://commoncrawl/projects/hyperlinkgraph/cc-main-2020-feb-mar-may/domain/$f .; | |
done | |
# variable to execute jave with all webgraph jars on the class path | |
WG="cc-webgraph/src/script/webgraph_ranking/run_webgraph.sh" | |
# generate offset files *.offset | |
$WG it.unimi.dsi.webgraph.BVGraph -O -L cc-main-2020-feb-mar-may-domain | |
# also for the transpose of the graph | |
$WG it.unimi.dsi.webgraph.BVGraph -O -L cc-main-2020-feb-mar-may-domain-t | |
# generate connected components: weakly *.wcc, *.wccsizes, and strongly *.scc, *.sccsizes | |
$WG it.unimi.dsi.webgraph.algo.ConnectedComponents -m --renumber --sizes -t cc-main-2020-feb-mar-may-domain-t cc-main-2020-feb-mar-may-domain | |
$WG it.unimi.dsi.webgraph.algo.StronglyConnectedComponents --renumber --sizes cc-main-2020-feb-mar-may-domain | |
# generate statistics and degrees | |
# Note: if connected components files are present, these are used for statistics | |
# --save-degrees makes Stats generate text files holding | |
# - the number of degrees per node: *.outdegrees resp. *.indegrees | |
# - the frequency distributions of degrees: *.outdegree resp. *.indegree | |
# see also http://webgraph.di.unimi.it/docs/it/unimi/dsi/webgraph/Stats.html | |
$WG it.unimi.dsi.webgraph.Stats --save-degrees cc-main-2020-feb-mar-may-domain | |
# join degrees and node names | |
zcat cc-main-2020-feb-mar-may-domain-vertices.txt.gz \ | |
| paste - cc-main-2020-feb-mar-may-domain.outdegrees cc-main-2020-feb-mar-may-domain.indegrees \ | |
| gzip >cc-main-2020-feb-mar-may-domain-vertices-out-indegrees.txt.gz | |
# grep 4 nodes (domains) | |
zgrep -P '\t(com\.(google|facebook)|org\.(commoncrawl|wikipedia))\t' cc-main-2020-feb-mar-may-domain-vertices-out-indegrees.txt.gz | |
#id name numhosts outdegr indegrees | |
#22962984 com.facebook 5079 24 18359247 | |
#25338395 com.google 2902 219754 15844122 | |
#76320850 org.commoncrawl 4 197 683 | |
#79235849 org.wikipedia 1845 1902491 2480862 | |
#### String maps (see section "Rebuilding string maps" of http://law.di.unimi.it/tutorial.php) | |
# input is the second column of the vertices file containing the reversed domain names | |
# - the mapping of names to node numbers *.mph, see | |
# [sux4j mph package](http://sux4j.di.unimi.it/docs/it/unimi/dsi/sux4j/mph/package-summary.html) (minimal perfect hash) | |
zcat cc-main-2020-feb-mar-may-domain-vertices.txt.gz \ | |
| cut -f2 \ | |
| $WG it.unimi.dsi.sux4j.mph.GOV4Function cc-main-2020-feb-mar-may-domain.mph - | |
# - the string map *.smph (allows to verify whether a node name is present) | |
zcat cc-main-2020-feb-mar-may-domain-vertices.txt.gz \ | |
| cut -f2 \ | |
| $WG it.unimi.dsi.util.ShiftAddXorSignedStringMap cc-main-2020-feb-mar-may-domain.mph cc-main-2020-feb-mar-may-domain.smph | |
# - the front-coded list *.fcl | |
zcat cc-main-2020-feb-mar-may-domain-vertices.txt.gz \ | |
| cut -f2 \ | |
| $WG it.unimi.dsi.util.FrontCodedStringList -u -r 32 cc-main-2020-feb-mar-may-domain.fcl | |
# Note: building the [immutable external prefix map](http://dsiutils.di.unimi.it/docs/it/unimi/dsi/util/ImmutableExternalPrefixMap.html) *.iepm | |
# which would allow to map back and forth node names to numbers fails for the domain graph because sorting by domain hierarchy | |
# conflicts with lexical sorting: | |
# zcat cc-main-2020-feb-mar-may-domain-vertices.txt.gz \ | |
# | cut -f2 \ | |
# | $WG it.unimi.dsi.util.ImmutableExternalPrefixMap -b4Ki cc-main-2020-feb-mar-may-domain.iepm | |
# Exception in thread "main" java.lang.IllegalArgumentException: The provided term collection is not sorted [ac.alpha-seminar, ac.alpha] | |
Thanks! I've added the commands for the string maps. Finally, figured out how to work with Jython and Pywebgraph. However, loading the graph failed. Need to figure out why.
Got Pywebgraph + Jython running. Note: must use for now the fork of Pywebgraph (https://github.com/commoncrawl/py-web-graph).
Updated commands about mapping nodes names <> node IDs
- changed file name suffixes to follow usage in pywebgraph
- pywebgraph now allows to access nodes by name
However, loading the graph failed. Need to figure out why.
This was actually because of too little Java heap space which caused "java.lang.RuntimeException: java.lang.reflect.InvocationTargetException
" - not very helpful to figure out the reason (should print the complete stack trace to see the "caused by"?).
Need enough heap space: -Xmx12g
to load graph and node name map.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Line 20 requires:
$WG it.unimi.dsi.webgraph.BVGraph -O -L cc-main-2020-feb-mar-may-domain-t
to create the transpose offsets fileAs
cc-main-2020-feb-mar-may-domain-t.offsets
is required to generate the connected components