Skip to content

Instantly share code, notes, and snippets.

@hideojoho
Last active January 2, 2018 06:48
Show Gist options
  • Save hideojoho/95e82e81c19f34e704a37928efd993c3 to your computer and use it in GitHub Desktop.
Save hideojoho/95e82e81c19f34e704a37928efd993c3 to your computer and use it in GitHub Desktop.
How to set up a virtual machine to host a Wikipedia dump in MySQL

How to set up a virtual machine to host a Wikipedia dump in MySQL

This is an instruction specifically designed to host a Wikipedia dump on mysql. Be aware the reserved spec for the VM below.

  • Disk size: 500GB
  • Memory: 12GB
  • Client OS: Ubuntu 16.04
  • Host OS: MacOS 10.13.2

Install VirtualBox and Vagrant via Homebrew

$ brew update
$ brew install Caskroom/cask/virtualbox
$ brew install Caskroom/cask/virtualbox-extension-pack
$ brew install Caskroom/cask/vagrant
$ brew install Caskroom/cask/vagrant-manager

Install plugins

$ vagrant plugin install vagrant-vbguest
$ vagrant plugin install vagrant-disksize

Add a box

$ vagrant box add ubuntu/xenial64

Create a VM folder

$ mkdir Ubuntu-1604
$ cd Ubuntu-1604
$ vagrant init ubuntu/xenial64

Copy Vagrant files

  • Copy Vagrantfile and Vagrant_file.sh to Ubuntu-1604
$ vagrant up

In VM

Initialise MariaDB

$ sudo mysql_secure_installation

Set root password for Python access

$ sudo mysql -u root
MariaDB [(none)]> SET PASSWORD = PASSWORD('YOUR_PASSWORD');
MariaDB [(none)]> update mysql.user set plugin = 'mysql_native_password' where User='root';
MariaDB [(none)]> FLUSH PRIVILEGES;

Create a DB for Wikipedia

$ sudo mysql -u root
MariaDB [(none)]> create database jawiki character set binary;
Query OK, 1 row affected (0.00 sec)

MariaDB [(none)]> use jawiki;
Database changed

MariaDB [jawiki]> show variables like 'character%';
+--------------------------+----------------------------+                   │
| Variable_name            | Value                      |
+--------------------------+----------------------------+        
| character_set_client     | utf8mb4                    |
| character_set_connection | utf8mb4                    |
| character_set_database   | binary                    |
| character_set_filesystem | binary                     |
| character_set_results    | utf8mb4                    |
| character_set_server     | utf8mb4                    |
| character_set_system     | utf8                       |
| character_sets_dir       | /usr/share/mysql/charsets/ |
+--------------------------+----------------------------+        
8 rows in set (0.00 sec)

Create tables for Wikipedia

$ wget "https://phab.wmfusercontent.org/file/data/oa3txdvpzlnzff5hkglk/PHID-FILE-quy3u5xfqnh4y2nat4jk/tables.sql"
$ sudo mysql -u root jawiki < tables.sql
$ sudo mysql -u root jawiki
MariaDB [jawiki]> show tables;
+-----------------------+                                                   │
| Tables_in_jawiki      |
+-----------------------+
| archive               |
| bot_passwords         |
| category              |
...
| valid_tag             |
| watchlist             |
+-----------------------+                                        
52 rows in set (0.00 sec) 

Download Wikipedia dump files

Note: This might take a while.

Example dump: https://dumps.wikimedia.org/jawiki/20171001/

  • jawiki-20171001-pages-articles.xml.bz2 (2.4GB)
  • jawiki-20171001-categorylinks.sql.gz (165MB)
$ wget "https://dumps.wikimedia.org/jawiki/20171001/jawiki-20171001-pages-articles.xml.bz2"
$ wget "https://dumps.wikimedia.org/jawiki/20171001/jawiki-20171001-categorylinks.sql.gz"

Import Wikipedia dump files into DB

Note: This WILL take a long time, more than a cup of tea. Prepare to wait for at least 12-24 hours to complete.

$ java -jar ../../mediawiki-tools-mwdumper/target/mwdumper-1.25.jar --format=sql:1.25  --filter=latest --filter=notalk jawiki-20171001-pages-articles.xml.bz2 | sudo mysql -u root jawiki
$ gunzip jawiki-20171001-categorylinks.sql.gz
$ sudo mysql -u root jawiki < jawiki-20171001-categorylinks.sql
$ sudo mysql -u root jawiki
MariaDB [jawiki]> select count(*) from page;
+----------+
| count(*) |
+----------+
|  2208140 |
+----------+
1 row in set (1.49 sec) 
echo "Setting Locale to en_US.UTF-8..."
sudo echo "LANG=en_US.UTF-8" | sudo tee --append /etc/environment
sudo echo "LANGUAGE=en_US.UTF-8" | sudo tee --append /etc/environment
sudo echo "LC_ALL=en_US.UTF-8" | sudo tee --append /etc/environment
sudo echo "LC_CTYPE=en_US.UTF-8" | sudo tee --append /etc/environment
echo "Updating packages..."
sudo apt-get update
sudo apt-get -y upgrade
sudo apt-get -y install software-properties-common
echo "Install Java 8 and Maven..."
sudo add-apt-repository -y ppa:webupd8team/java
sudo apt-get update
sudo apt-get -y upgrade
echo debconf shared/accepted-oracle-license-v1-1 select true | sudo debconf-set-selections
echo debconf shared/accepted-oracle-license-v1-1 seen true | sudo debconf-set-selections
sudo apt-get -y install oracle-java8-installer
sudo apt-get -y install maven
echo "Installing MariaDB ..."
sudo apt update
sudo apt-get -y install mariadb-server
sudo sed -i 's/^#\[server\]/\[server\]\nmax_allowed_packet = 16MB/g' /etc/mysql/mariadb.conf.d/50-server.cnf
sudo systemctl restart mysql.service
echo "Installing MWDumper ..."
cd
mkdir Local
cd Local
git clone https://github.com/wikimedia/mediawiki-tools-mwdumper.git
cd mediawiki-tools-mwdumper
mvn package
cd
echo "Installing Python (Miniconda) ..."
cd
mkdir Local/Miniconda
cd Local/Miniconda
wget "https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh"
bash ./Miniconda3-latest-Linux-x86_64.sh -b -u -p $HOME/Local/Miniconda
echo "export PATH=\$HOME/Local/Miniconda/bin:\$PATH" >> $HOME/.bash_profile
exec "$SHELL"
conda install -y mysql-connector-python
# -*- mode: ruby -*-
# vi: set ft=ruby :
# All Vagrant configuration is done below. The "2" in Vagrant.configure
# configures the configuration version (we support older styles for
# backwards compatibility). Please don't change it unless you know what
# you're doing.
Vagrant.configure("2") do |config|
# The most common configuration options are documented and commented below.
# For a complete reference, please see the online documentation at
# https://docs.vagrantup.com.
# Every Vagrant development environment requires a box. You can search for
# boxes at https://atlas.hashicorp.com/search.
config.vm.box = "ubuntu/xenial64"
config.disksize.size = '500GB'
# Disable automatic box update checking. If you disable this, then
# boxes will only be checked for updates when the user runs
# `vagrant box outdated`. This is not recommended.
# config.vm.box_check_update = false
# Create a forwarded port mapping which allows access to a specific port
# within the machine from a port on the host machine. In the example below,
# accessing "localhost:8080" will access port 80 on the guest machine.
# config.vm.network "forwarded_port", guest: 80, host: 8080
# Create a private network, which allows host-only access to the machine
# using a specific IP.
config.vm.network "private_network", ip: "192.168.33.10"
# Create a public network, which generally matched to bridged network.
# Bridged networks make the machine appear as another physical device on
# your network.
# config.vm.network "public_network"
# Share an additional folder to the guest VM. The first argument is
# the path on the host to the actual folder. The second argument is
# the path on the guest to mount the folder. And the optional third
# argument is a set of non-required options.
config.vm.synced_folder ".", "/vagrant", type: "virtualbox"
# Provider-specific configuration so you can fine-tune various
# backing providers for Vagrant. These expose provider-specific options.
# Example for VirtualBox:
#
config.vm.provider "virtualbox" do |vb|
# # Display the VirtualBox GUI when booting the machine
# vb.gui = true
#
# # Customize the amount of memory on the VM:
vb.memory = "12288"
end
#
# View the documentation for the provider you are using for more
# information on available options.
# Define a Vagrant Push strategy for pushing to Atlas. Other push strategies
# such as FTP and Heroku are also available. See the documentation at
# https://docs.vagrantup.com/v2/push/atlas.html for more information.
# config.push.define "atlas" do |push|
# push.app = "YOUR_ATLAS_USERNAME/YOUR_APPLICATION_NAME"
# end
# Enable provisioning with a shell script. Additional provisioners such as
# Puppet, Chef, Ansible, Salt, and Docker are also available. Please see the
# documentation for more information about their specific syntax and use.
config.vm.provision "shell", privileged: false, path: "Vagrant_provision.sh"
end
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment