This is an instruction specifically designed to host a Wikipedia dump on mysql. Be aware the reserved spec for the VM below.
- Disk size: 500GB
- Memory: 12GB
- Client OS: Ubuntu 16.04
- Host OS: MacOS 10.13.2
$ brew update
$ brew install Caskroom/cask/virtualbox
$ brew install Caskroom/cask/virtualbox-extension-pack
$ brew install Caskroom/cask/vagrant
$ brew install Caskroom/cask/vagrant-manager
$ vagrant plugin install vagrant-vbguest
$ vagrant plugin install vagrant-disksize
$ vagrant box add ubuntu/xenial64
$ mkdir Ubuntu-1604
$ cd Ubuntu-1604
$ vagrant init ubuntu/xenial64
- Copy Vagrantfile and Vagrant_file.sh to Ubuntu-1604
$ vagrant up
$ sudo mysql_secure_installation
$ sudo mysql -u root
MariaDB [(none)]> SET PASSWORD = PASSWORD('YOUR_PASSWORD');
MariaDB [(none)]> update mysql.user set plugin = 'mysql_native_password' where User='root';
MariaDB [(none)]> FLUSH PRIVILEGES;
$ sudo mysql -u root
MariaDB [(none)]> create database jawiki character set binary;
Query OK, 1 row affected (0.00 sec)
MariaDB [(none)]> use jawiki;
Database changed
MariaDB [jawiki]> show variables like 'character%';
+--------------------------+----------------------------+ │
| Variable_name | Value |
+--------------------------+----------------------------+
| character_set_client | utf8mb4 |
| character_set_connection | utf8mb4 |
| character_set_database | binary |
| character_set_filesystem | binary |
| character_set_results | utf8mb4 |
| character_set_server | utf8mb4 |
| character_set_system | utf8 |
| character_sets_dir | /usr/share/mysql/charsets/ |
+--------------------------+----------------------------+
8 rows in set (0.00 sec)
$ wget "https://phab.wmfusercontent.org/file/data/oa3txdvpzlnzff5hkglk/PHID-FILE-quy3u5xfqnh4y2nat4jk/tables.sql"
$ sudo mysql -u root jawiki < tables.sql
$ sudo mysql -u root jawiki
MariaDB [jawiki]> show tables;
+-----------------------+ │
| Tables_in_jawiki |
+-----------------------+
| archive |
| bot_passwords |
| category |
...
| valid_tag |
| watchlist |
+-----------------------+
52 rows in set (0.00 sec)
Note: This might take a while.
Example dump: https://dumps.wikimedia.org/jawiki/20171001/
- jawiki-20171001-pages-articles.xml.bz2 (2.4GB)
- jawiki-20171001-categorylinks.sql.gz (165MB)
$ wget "https://dumps.wikimedia.org/jawiki/20171001/jawiki-20171001-pages-articles.xml.bz2"
$ wget "https://dumps.wikimedia.org/jawiki/20171001/jawiki-20171001-categorylinks.sql.gz"
Note: This WILL take a long time, more than a cup of tea. Prepare to wait for at least 12-24 hours to complete.
$ java -jar ../../mediawiki-tools-mwdumper/target/mwdumper-1.25.jar --format=sql:1.25 --filter=latest --filter=notalk jawiki-20171001-pages-articles.xml.bz2 | sudo mysql -u root jawiki
$ gunzip jawiki-20171001-categorylinks.sql.gz
$ sudo mysql -u root jawiki < jawiki-20171001-categorylinks.sql
$ sudo mysql -u root jawiki
MariaDB [jawiki]> select count(*) from page;
+----------+
| count(*) |
+----------+
| 2208140 |
+----------+
1 row in set (1.49 sec)