Skip to content

Instantly share code, notes, and snippets.

@sebastian-nagel
Last active September 28, 2020 13:38
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save sebastian-nagel/95a086d7e8c76adc81d647258926e281 to your computer and use it in GitHub Desktop.
Save sebastian-nagel/95a086d7e8c76adc81d647258926e281 to your computer and use it in GitHub Desktop.
webgraph commands
### Jython
# install Jython (see https://www.jython.org/download)
wget https://repo1.maven.org/maven2/org/python/jython-standalone/2.7.2/jython-standalone-2.7.2.jar
# clone pywebgraph (fork with modifications)
git clone https://github.com/commoncrawl/py-web-graph.git
cd py-web-graph
# copy console.py into current working directory so that "pywebgraph" is visible as package
cp pywebgraph/console.py .
# $WG_CP must hold the class path for the webgraph package,
# cf. the script run_webgraph.sh
# Note: use `;' as separator on Windows
DIR=$PWD # or modify if webgraph is installed in a different location
WG_CP=$DIR/webgraph-$WEBGRAPH_VERSION.jar:$(ls $DIR/deps/*.jar | tr '\n' ':')
# Note:
# - need enough Java heap space (-Xmx...) to load large graphs
# - adapt path pointing to Python executable
java -Xmx12g -Dpython.console=org.python.util.JLineConsole -Dpython.executable=/bin/python2.7 -cp $WG_CP:../jython-standalone-2.7.2.jar: org.python.util.jython console.py
pyWebGraph console, Copyright (C) 2009 Massimo Santini
>>
>> graph .../cc-main-2020-feb-mar-may-domain
>> pwn
#0
>> namemaps cc-main-2020-feb-mar-may-domain
>> cn "org.commoncrawl"
>> pwn
#76320850 org.commoncrawl
>> ls
0: #1143797 au.com.dejanseo
1: #1644111 au.com.spatialsource
2: #2474905 be.youtu
...
190: #79223837 org.whatwg
191: #79235545 org.wikimedia
192: #79235849 org.wikipedia
193: #79236154 org.wikireverse
194: #80908556 re.slidesha
195: #87060356 uk.co.bbc
196: #89576687 us.lumeno
>> sl
0: #69452 ai.botxo
1: #74455 ai.kritikalvision
...
675: #89325941 uk.org.pigsonthewing
676: #89427893 uk.webxtrakt
677: #89611311 us.onehack
678: #89622268 us.pingpong
679: #89702797 us.zillman
680: #89834247 vn.avnuc
681: #90129490 wiki.sysadmin
682: #90203492 work.yokonoji
# see
# https://github.com/commoncrawl/cc-webgraph
# http://webgraph.di.unimi.it/
# http://law.di.unimi.it/tutorial.php
# download domain-level graphs
# https://commoncrawl.org/2020/06/host-and-domain-level-web-graphs-febmarmay-2020/
for f in cc-main-2020-feb-mar-may-domain-t.graph cc-main-2020-feb-mar-may-domain-t.properties \
cc-main-2020-feb-mar-may-domain.graph cc-main-2020-feb-mar-may-domain.properties \
cc-main-2020-feb-mar-may-domain.stats \
cc-main-2020-feb-mar-may-domain-edges.txt.gz; do
aws --no-sign-request s3 cp s3://commoncrawl/projects/hyperlinkgraph/cc-main-2020-feb-mar-may/domain/$f .;
done
# variable to execute jave with all webgraph jars on the class path
WG="cc-webgraph/src/script/webgraph_ranking/run_webgraph.sh"
# generate offset files *.offset
$WG it.unimi.dsi.webgraph.BVGraph -O -L cc-main-2020-feb-mar-may-domain
# also for the transpose of the graph
$WG it.unimi.dsi.webgraph.BVGraph -O -L cc-main-2020-feb-mar-may-domain-t
# generate connected components: weakly *.wcc, *.wccsizes, and strongly *.scc, *.sccsizes
$WG it.unimi.dsi.webgraph.algo.ConnectedComponents -m --renumber --sizes -t cc-main-2020-feb-mar-may-domain-t cc-main-2020-feb-mar-may-domain
$WG it.unimi.dsi.webgraph.algo.StronglyConnectedComponents --renumber --sizes cc-main-2020-feb-mar-may-domain
# generate statistics and degrees
# Note: if connected components files are present, these are used for statistics
# --save-degrees makes Stats generate text files holding
# - the number of degrees per node: *.outdegrees resp. *.indegrees
# - the frequency distributions of degrees: *.outdegree resp. *.indegree
# see also http://webgraph.di.unimi.it/docs/it/unimi/dsi/webgraph/Stats.html
$WG it.unimi.dsi.webgraph.Stats --save-degrees cc-main-2020-feb-mar-may-domain
# join degrees and node names
zcat cc-main-2020-feb-mar-may-domain-vertices.txt.gz \
| paste - cc-main-2020-feb-mar-may-domain.outdegrees cc-main-2020-feb-mar-may-domain.indegrees \
| gzip >cc-main-2020-feb-mar-may-domain-vertices-out-indegrees.txt.gz
# grep 4 nodes (domains)
zgrep -P '\t(com\.(google|facebook)|org\.(commoncrawl|wikipedia))\t' cc-main-2020-feb-mar-may-domain-vertices-out-indegrees.txt.gz
#id name numhosts outdegr indegrees
#22962984 com.facebook 5079 24 18359247
#25338395 com.google 2902 219754 15844122
#76320850 org.commoncrawl 4 197 683
#79235849 org.wikipedia 1845 1902491 2480862
#### String maps (see section "Rebuilding string maps" of http://law.di.unimi.it/tutorial.php)
# input is the second column of the vertices file containing the reversed domain names
# - the mapping of names to node numbers *.mph, see
# [sux4j mph package](http://sux4j.di.unimi.it/docs/it/unimi/dsi/sux4j/mph/package-summary.html) (minimal perfect hash)
zcat cc-main-2020-feb-mar-may-domain-vertices.txt.gz \
| cut -f2 \
| $WG it.unimi.dsi.sux4j.mph.GOV4Function cc-main-2020-feb-mar-may-domain.mph -
# - the string map *.smph (allows to verify whether a node name is present)
zcat cc-main-2020-feb-mar-may-domain-vertices.txt.gz \
| cut -f2 \
| $WG it.unimi.dsi.util.ShiftAddXorSignedStringMap cc-main-2020-feb-mar-may-domain.mph cc-main-2020-feb-mar-may-domain.smph
# - the front-coded list *.fcl
zcat cc-main-2020-feb-mar-may-domain-vertices.txt.gz \
| cut -f2 \
| $WG it.unimi.dsi.util.FrontCodedStringList -u -r 32 cc-main-2020-feb-mar-may-domain.fcl
# Note: building the [immutable external prefix map](http://dsiutils.di.unimi.it/docs/it/unimi/dsi/util/ImmutableExternalPrefixMap.html) *.iepm
# which would allow to map back and forth node names to numbers fails for the domain graph because sorting by domain hierarchy
# conflicts with lexical sorting:
# zcat cc-main-2020-feb-mar-may-domain-vertices.txt.gz \
# | cut -f2 \
# | $WG it.unimi.dsi.util.ImmutableExternalPrefixMap -b4Ki cc-main-2020-feb-mar-may-domain.iepm
# Exception in thread "main" java.lang.IllegalArgumentException: The provided term collection is not sorted [ac.alpha-seminar, ac.alpha]
@Xue-Alex
Copy link

Line 20 requires:

$WG it.unimi.dsi.webgraph.BVGraph -O -L cc-main-2020-feb-mar-may-domain-t to create the transpose offsets file

As cc-main-2020-feb-mar-may-domain-t.offsets is required to generate the connected components

@sebastian-nagel
Copy link
Author

Thanks! I've added the commands for the string maps. Finally, figured out how to work with Jython and Pywebgraph. However, loading the graph failed. Need to figure out why.

@sebastian-nagel
Copy link
Author

sebastian-nagel commented Sep 22, 2020

Got Pywebgraph + Jython running. Note: must use for now the fork of Pywebgraph (https://github.com/commoncrawl/py-web-graph).

@sebastian-nagel
Copy link
Author

Updated commands about mapping nodes names <> node IDs

  • changed file name suffixes to follow usage in pywebgraph
  • pywebgraph now allows to access nodes by name

@sebastian-nagel
Copy link
Author

However, loading the graph failed. Need to figure out why.

This was actually because of too little Java heap space which caused "java.lang.RuntimeException: java.lang.reflect.InvocationTargetException
" - not very helpful to figure out the reason (should print the complete stack trace to see the "caused by"?).

Need enough heap space: -Xmx12g to load graph and node name map.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment