Skip to content

Instantly share code, notes, and snippets.

@mathew-hall
Created December 14, 2015 15:45
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save mathew-hall/e2b6373d225ef7fd4e10 to your computer and use it in GitHub Desktop.
Save mathew-hall/e2b6373d225ef7fd4e10 to your computer and use it in GitHub Desktop.
Wikipedia pagelinks postgres import; fixes up MySQL format problems that interfere with psql. Imports into the wiki database as `postgres`.
#!/usr/bin/env bash
pv enwiki-20151102-pagelinks.sql.gz | zcat | ./sql.pl | sudo -u postgres psql wiki >psql_err.log 2>&1
#!/usr/bin/env perl
print("SET standard_conforming_strings = 'off';\n");
print("SET backslash_quote = 'on';\n");
$nr = 0;
while(<>){
s/`/"/g if ($nr < 39);
s/int\(\d+\)( unsigned)?/INTEGER/g;
s/UNIQUE KEY "\w+"/UNIQUE/g;
s/ENGINE=InnoDB DEFAULT CHARSET=binary//;
s/varbinary\(255\)/TEXT/g;
s/KEY "\w+"/UNIQUE/g;
s/`PAGELINKS`/pagelinks/ig;
s/\\"/"/g;
s/\\'/''/g;
print;
$nr = $nr + 1;
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment