Source: https://kodingnotes.wordpress.com/2014/12/03/parsing-wikipedia-page-hierarchy/
Wikipedia sql dumps are in mysql format. If you want to process them in sqlite or duckdb, this is how.
Optionally use: https://github.com/dumblob/mysql2sqlite/
I used this to process the table definitions, which I then simplified to get page.sql
and categorylinks.sql
.
cat enwiki-latest-page.sql | bash mysql2tsv.bash > enwiki-latest-page.tsv # ~52M lines
cat enwiki-latest-categorylinks.sql | bash mysql2tsv.bash > enwiki-latest-categorylinks.tsv # ~154M lines
cat page.sql | sqlite3 enwiki.sqlite3
cat categorylinks.sql | sqlite3 enwiki.sqlite3
echo $'.sep \t\n.import enwiki-latest-page.tsv page' | sqlite3 enwiki.sqlite3
echo $'.sep \t\n.import enwiki-latest-categorylinks.tsv categorylinks' | sqlite3 enwiki.sqlite3
Very helpful post!