Skip to content

Instantly share code, notes, and snippets.

@bennokr
Last active January 12, 2023 13:43
Show Gist options
  • Star 2 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save bennokr/43054f250260a9538882383363fded59 to your computer and use it in GitHub Desktop.
Save bennokr/43054f250260a9538882383363fded59 to your computer and use it in GitHub Desktop.
Parsing Wikipedia Page Hierarchy in sqlite or duckdb

Parsing Wikipedia Page Hierarchy in sqlite or duckdb

Source: https://kodingnotes.wordpress.com/2014/12/03/parsing-wikipedia-page-hierarchy/

Wikipedia sql dumps are in mysql format. If you want to process them in sqlite or duckdb, this is how.

Optionally use: https://github.com/dumblob/mysql2sqlite/

I used this to process the table definitions, which I then simplified to get page.sql and categorylinks.sql.

  cat enwiki-latest-page.sql | bash mysql2tsv.bash > enwiki-latest-page.tsv # ~52M lines
  cat enwiki-latest-categorylinks.sql | bash mysql2tsv.bash > enwiki-latest-categorylinks.tsv # ~154M lines
  
  cat page.sql | sqlite3 enwiki.sqlite3
  cat categorylinks.sql | sqlite3 enwiki.sqlite3
  
  echo $'.sep \t\n.import enwiki-latest-page.tsv page' | sqlite3 enwiki.sqlite3
  echo $'.sep \t\n.import enwiki-latest-categorylinks.tsv categorylinks' | sqlite3 enwiki.sqlite3
grep -a -i ^insert | sed 's/^[^(]*(//;s/),(/\n/g;s/);$//' | sed "s/,\([0-9']\)/\t\1/g" | sed "s/^'//;s/\t'/\t/g;s/'\t/\t/g;s/'$//;s/\\\//g"
BEGIN TRANSACTION;
CREATE TABLE page (
page_id integer NOT NULL PRIMARY KEY
, page_namespace integer NOT NULL DEFAULT '0'
, page_title text NOT NULL DEFAULT ''
, page_restrictions text
, page_is_redirect integer NOT NULL DEFAULT '0'
, page_is_new integer NOT NULL DEFAULT '0'
, page_random double NOT NULL DEFAULT '0'
, page_touched text NOT NULL DEFAULT ''
, page_links_updated text DEFAULT NULL
, page_latest integer NOT NULL DEFAULT '0'
, page_len integer NOT NULL DEFAULT '0'
, page_content_model text DEFAULT NULL
, page_lang text DEFAULT NULL
, UNIQUE (page_namespace,page_title)
);
CREATE INDEX "idx_page_page_random" ON "page" (page_random);
CREATE INDEX "idx_page_page_len" ON "page" (page_len);
CREATE INDEX "idx_page_page_redirect_namespace_len" ON "page" (page_is_redirect,page_namespace,page_len);
END TRANSACTION;
@liyucheng09
Copy link

Very helpful post!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment