benjaminballard/example_ops_manual.md

## example_ops_manual.md

      
    Raw
  

              example_ops_manual.md
            
          
     VoltDB Operations Manual

This is an example of a document that a VoltDB customer or partner might write to document specific processes and commands for common operations with VoltDB.
Hardware & OS

For our VoltDB cluster, use servers with the following specs:

(Note: this is only an example, but you should document the exact hardware used for the installation, and use identical hardware if nodes to the cluster).
Amazon EC2 Example:
In production, we use c3.2xlarge nodes with 8 hyperthreads, 15GB of RAM
In test, we use c3.2xlarge nodes that has 8 hyperthreads, 15GB of RAM
In dev, we use a c3.xlarge that has 4 hyperthreads, 7.5GB of RAM

Bare-metal example:
Model: HP ProLiant DL380
CPUs: 2x Intel Exxxx xx.x GHz (?)-core
RAM: (?)x 16GB xxxx speed ECC
OS: Ubuntu 14.04 LTS

see Using VoltDB 2.1 Operating System and Software Requirements
Install Pre-requisites

Java JDK
Install OpenJDK 8, the full JDK:
apt-get install openjdk-8-jdk

Note: Sun JDK or OpenJDK version 7 or 8 are supported.  The command above is for Ubuntu.
Configure NTP
Install NTP
sudo apt-get install ntp  
sudo service ntp stop

sudo vi /etc/ntp.conf

Remove any lines beginning with "server" and add the following lines
server time1.google.com
server time2.google.com
server time3.google.com

Run the following command repeatedly until the offset is less than +/- 0.001 sec
sudo ntpdate -b -p 8 time1.google.com

Restart NTP service
sudo service ntp start

To check that NTP is configured and clocks are closely synchronized:
ntpq -p host1 host2 host3

Other Configurations
See Administrator's Guide, Ch.2.
Install VoltDB

To install VoltDB, create a user “voltdb” with default permissions and log in with that user to install VoltDB in the home directory:
cd /home/voltdb
tar -xzvf LINUX-voltdb-ent-5.2.1.tar.gz

create a symlink for this version of VoltDB for use in the PATH
ln -s voltdb-ent-5.2.1 voltdb

add this to the PATH by adding the following line to .bashrc:
export PATH=$PATH:$HOME/voltdb/bin

test to ensure you can run the voltdb bin tools
voltdb --version

Deploy application files

Repeat the steps in this section for each server in the cluster.
Make a home folder for the application
mkdir $HOME/our_app

Make a folder for saved snapshots
mkdir $HOME/our_app/saved_snapshots

Copy the DDL.sql, deployment.xml, and license.xml files to the "our_app" folder.

Checkout the application source code to this folder.  This includes the DDL.sql, and src directory for stored procedure code.
Copy the deployment.xml and license.xml files into this folder.
Starting the cluster

To start for the first time (initially with no schema or data)

On each of the servers (voltserver1, voltserver2, voltserver3), run the following command:
# on each host
voltdb create -B -d deployment.xml -l license.xml -H voltserver1


The -B parameter will start the database as a background daemon, and redirect console outputs to the $HOME/.voltdb_server directory.
The -H parameter idenfities one of the hosts that the others will connect to when initializing the cluster.  After initialization, there is nothing special about this particular host.

Verifying that the database is available

You can tail the log file which is in a subdirectory "log" in the working folder where voltdb was started.  The database outputs "Server completed initialization." when it becomes available.
tail -f log/volt.log

You can also tail the console output, which is in the "$HOME/.voltdb_server" directory, for example:
tail -f $HOME/.voltdb_server/localhost_3021.out

Another way to check that the cluster initialized is to connect using sqlcmd.
# from any host in the cluster
sqlcmd
> exit

Loading the Schema

First compile any java stored procedures.  This example assumes the procedure java source files are in a subdirectory "./src/...", for example "./src/org/yourcompany/procedures/*.java"
cd $HOME/our_app
export CLASSPATH=$HOME/voltdb/voltdb/voltdb-*.jar
SRC=`find src -name "*.java"`
mkdir -p classes
javac -classpath $CLASSPATH -d classes $SRC

Then package the class files into a .jar file (the classes are no longer needed at this point)
jar cf procs.jar -C classes .    
rm -rf classes

Load the schema into VoltDB
sqlcmd < ddl.sql

Note: ddl.sql should contain the command "LOAD CLASSES procs.jar;" to load the stored procedure classes before declaring "CREATE PROCEDURE ... FROM CLASS ...".
Stopping the cluster

From any server in the cluster, in any directory, run the following.
# from any host in the cluster

voltadmin pause

# use today's date as the last parameter
voltadmin save /home/voltdb/our_app/saved_snapshots 2015-01-01

voltadmin shutdown

Restarting after a shutdown

On each of the servers (voltserver1, voltserver2, voltserver3), run the following command.  Use -H voltserver1 in all cases, this will tell the voltserver1 it is the leader for cluster startup, and it will tell the other servers to connect to voltserver1 to join the cluster.
# on each host
voltdb recover -B -d deployment.xml -H voltserver1 -l license.xml

The -B parameter will start the database as a background daemon.
Note: recover should be used to recover the database exactly as it was before shutdown, do not modify the deployment.xml when using this command.  To make configuration changes, use the maintenance window process (see below).
Turn off user access to the cluster

From any server in the cluster, in any directory, run:
# from any host in the cluster
voltadmin pause

Turn back on user access to the cluster

From any server in the cluster, in any directory, run:
# from any host in the cluster
voltadmin resume

Updating the license file

When a license expires, the cluster will continue to run, but subsequent actions such as stopping and restarting may require that the new license be in place.  Prior to expiration, the new license should be copied in place of the old file so that it will be picked up by any subsequent actions that require a license check.  This can be done while the cluster is running (it will have no immediate effect).
cd $HOME/our_app
cp <new_license_file.xml> license.xml

Taking a manual backup

A snapshot is a point-in-time consistent copy of the entire contents of the database.  It is written to local disk at each node in a cluster to distribute the work of persistence.  A snapshot can be taken at any time, whether the database is online and available to users or in admin mode.
Note: The directory you take a snapshot into must exist on each server in the cluster, and the voltdb user must have write permission to it.
# Use today's date and the current time as the "nonce" or prefix for the snapshot
voltadmin save /home/voltdb/our_app/saved_snapshots 2015-01-01_0000

Note: if you need to take a backup prior to stopping the cluster for maintenance, see the section below for additional steps.
Restoring from a manual backup

First, follow the steps documented above to start the cluster using the "voltdb create" command and load the schema.  Then run the following command to load the snapshot data.
# run from any directory
voltadmin restore /home/voltdb/our_app/saved_snapshots 2015-01-01_0000

Note: The files with this prefix need to be located in the specified path on one or more of the servers in the cluster.  For example, if you are loading a backup of production to test, you need to first copy these files from this path on the production servers to the same path on the test servers.
Stopping and restarting the cluster for a maintenance window

To make major configuration changes to the database, such as changing the kfactor or sitesperhost settings, or modifying the configuration of command logging or export, the database must be stopped and restarted.  Using "voltdb recover" to recover from the command log is for recovering the cluster in exactly the same configuration, not to make configuration changes, so the process involves restarting the database empty as if for the first time using "voltdb create" and then restoring a backup.  The additional commands "pause" and "resume" ensure that users do not try to add any data between when the snapshot (backup) is taken and when it is restored.


Pause the database (stop accepting requests from users)
voltadmin pause


Take a manual snapshot
voltadmin save --blocking /home/voltdb/our_app/saved_snapshots snapshot_prefix


Note: the "--blocking" parameter means voltadmin will wait to make sure the snapshot was successful before returning.


Shut down the database
voltadmin shutdown


Make changes (update the catalog or deployment files)


It is recommended to restart the cluster in "admin mode" which means "paused", so that if users are able to connect right away, they won't be able to make any writes until the database is resumed, after the snapshot is restored.
<deployment>
  ...
  <admin-mode port="21211" adminstartup="true"/>
</deployment>

Remember to copy the catalog and deployment file to all servers in the cluster after making changes.


Restart the database
on each host

cd $HOME/our_app
voltdb create -B -d deployment.xml -l license.xml -H voltserver1


Reload the schema
sqlcmd < ddl.sql


Reload the data from the snapshot
voltadmin restore /home/voltdb/our_app/saved_snapshots snapshot_prefix


Resume the database (allow users to connect and send requests)
voltadmin resume


Rejoining a node to the cluster

If a node failed, it can be restarted and joined back to the cluster using the "voltdb rejoin" command.
# run from the app working folder where the deployment file is located
cd /home/voltdb/our_app
voltdb rejoin --host=voltserver1 --deployment=deployment.xml

If for some reason this node is not easily restarted (i.e. due to hardware failure) another node could be started and joined to the cluster to take the place of the failed node.  In this case, this new node would need to be prepared like all the other servers, with VoltDB and all the prerequisite software installed and NTP configured.
Stopping a node manually

Some maintenance actions (such as adding more memory, or applying Linux updates) can be performed on individual servers one at a time, without any downtime for the entire cluster.  In those cases, use the following command to stop an individual cluster, then see the previous section for how to rejoin.
echo "exec @SystemInformation OVERVIEW;" | sqlcmd | grep HOSTNAME

This shows the host_id value for each hostname.  Then you can stop a particular host using the following command (for example, if you wanted to stop host_id 1)
sqlcmd
1> exec @StopNode 1
Note, it is also fine to just kill -9 the VoltDB process on the particular host you want to stop.  You can use the "jps" command to see which process id VoltDB is using.
Adding Capacity

To add more capacity (RAM as well as throughput) to the cluster, additional servers can be added to the cluster.  As with the others, these servers must be set up with VoltDB and all pre-requisite software installed and NTP configured.  The following command is used to have this node join the cluster.
# run on each server to be added

# the host parameter can be the hostname of any existing server in the cluster
cd /home/voltdb/our_app
voltdb add --host=voltserver1 --license=license.xml

Note: K-factor + 1 is the number of servers that must be added at a time.  If the cluster is configured with k-factor=1, then 2 servers must be added.