Processing scripts for Wikipedia clickstream
New steps
wget https://dumps.wikimedia.org/other/clickstream/2018-03/clickstream-enwiki-2018-03.tsv.gz
gzip -cd clickstream-enwiki-2018-03.tsv.gz \
| sort -t $'\t' -k 2,2 \
| ./squash.py \
> counted-clickstream-enwiki-2018-03.tsv
squash.py
requires much less memory because it takes advantage of the fact
that the clickstream data will be sorted based on the curr
column. This means
it only has to remember the part of the clickstream data up till the next
curr
page, and can print the "squashed" data as soon as it gets to the next
block of curr
pages.
Steps (old)
First, optionally run filter_wv.py
to filter the raw TSV files to get only
the lines of interest.
Second, run show_summary.py
to print a summary of the TSV files. The
summaries will look like the following:
DOING all/2016_02_en_clickstream.tsv
TOTAL SIZE: 6695402206
PREV!=Wikipedia: 1770548247
PREV== other-internal 623664
PREV== other-search 0
PREV== other-external 0
PREV== other-empty 1648385145
PREV== other-other 121539438
DOING all/2016_03_en_clickstream.tsv
TOTAL SIZE: 6814906943
PREV!=Wikipedia: 2263238989
PREV== other-internal 545915
PREV== other-search 0
PREV== other-external 0
PREV== other-empty 2156754015
PREV== other-other 105939059