Skip to content

Instantly share code, notes, and snippets.

@myui
Last active August 29, 2015 14:21
Show Gist options
  • Save myui/cb38827c1c41332041b6 to your computer and use it in GitHub Desktop.
Save myui/cb38827c1c41332041b6 to your computer and use it in GitHub Desktop.
Real time prediction on MySQL and batch model construction on Hivemall

Hivemall provides a batch learning scheme that builds prediction models on Apache Hadoop. The learning process itself is a batch process; however, an online/real-time prediction can be achieved by carrying a prediction on a transactional relational DBMS.

In this article, we explain how to run a real-time prediction using a relational DBMS. We assume that you have already run the a9a binary classification task.

Online/Offline Matrix of Machine Learning

The following table shows the type matrix of machine learning schemes and applications.

Purpose Online/Real-time Training Offline/Batch Training
Online/Real-time Prediction
  • Algorithm Trade (HFT)
  • Twitter real-time analysis
  • Adtech (e.g., Real-time CTR/CVR prediction)
  • Real-time recommendation
Offline/Batch Prediction No/few needs?
  • Daily/weekly batch systems
  • Business Analytics/Reporting

Batch Training + Batch Prediction

Tools like Mahout, R, SAS/SPSS, and Weka provides Batch Training and Batch Prediction schemes. Using them for Online Prediction is not straightforward. The Batch schemes are suitable for making BI purposes generating daily/weekly reports.

Hivemall provides a Batch/Offline Training scheme as well as a Batch Prediction scheme on Apache Hive.

Online Training

Few tools such as Jubatus and Moa provides Online Learning (Online Training and Online Prediction) scheme.

The pure Online Learning scheme is beneficial for real-time analysis such as algorithm training and twitter real-time analysis though it is hard to get a good prediction model without iterative training (Note: Online Learning scheme does not perform iterative training).

Apparently, the combination of Offline Prediction and Online Training does not make sense.

Hybrid of Batch Training and Online Prediction

Online machine algorithms provides in Scikit-learn can be used for Batch Training and Online Prediction.

Scikit-learn is a programming library and the prediction models need to be resident in memory that a user application runs. Because of that, users need to consider scalability problems where a lot of request come from many servers (e.g., predicting Ad-click probabilities for each client access).

Building Server/Client systems that provide a standard and convenient APIs is mandatory for business applications.

Amazon Machine Learning provides real-time prediction scheme and it cost $0.0001 for each prediction ($0.1 for 1,000 requests) + $instances/hour. The cost model is not apparently suitable for frequent predictions (e.g., predicting CTR for each page view). It is normal for adtech vendors to handle more than billions of requests (i.e., predictions) for each day.

Alternatively, Hivemall provides a solution for online prediction: Exporting prediction models to a vanilla RDBMS. You can take the benefits of RDBMSs including a standard SQL access that can be used from almost all programming languages, quick responses using (B-Tree) indexes, reliability and scalability to handle many client requests.

Prerequisites

  • MySQL

Put mysql-connector-java.jar (JDBC driver) on $SQOOP_HOME/lib.

  • Sqoop

Sqoop 1.4.5 does not support Hadoop v2.6.0. So, you need to build packages for Hadoop 2.6. To do that you need to edit build.xml and ivy.xml as shown in this patch.

Preparing Model Tables on MySQL

create database a9a;
use a9a;

create user sqoop identified by 'sqoop';
grant all privileges on a9a.* to 'sqoop'@'%' identified by 'sqoop';
flush privileges;

create table a9a_model1 (
  feature int, 
  weight double
);

exit;

Do not forget to edit bind_address in the MySQL configuration file (/etc/mysql/my.conf) accessible from master and slave nodes of Hadoop.

Exporting Hive table to MySQL

Check the connectivity to MySQL server using Sqoop.

export HADOOP_HOME=/opt/hadoop
export HADOOP_CONF_DIR=${HADOOP_HOME}/etc/hadoop/
export HADOOP_COMMON_HOME=${HADOOP_HOME}

bin/sqoop list-tables --connect jdbc:mysql://localhost/a9a --username sqoop --password sqoop

Create TSV table because Sqoop cannot directory read Hive tables.

create table a9a_model1_tsv 
  ROW FORMAT DELIMITED 
    FIELDS TERMINATED BY "\t"
    LINES TERMINATED BY "\n"
  STORED AS TEXTFILE
AS
select * from a9a_model1;

Check the location of 'a9a_model1_tsv' as follows:

desc extended a9a_model1_tsv;
> location:hdfs://dm01:9000/user/hive/warehouse/a9a.db/a9a_model1_tsv
export MYSQL_HOST=dm01

bin/sqoop export \
--connect jdbc:mysql://${MYSQL_HOST}/a9a \
--username sqoop --password sqoop \
--table a9a_model1 \
--export-dir /user/hive/warehouse/a9a.db/a9a_model1_tsv \
--input-fields-terminated-by '\t' --input-lines-terminated-by '\n' \
--batch

When the exporting successfully finishes, you can find entries in the model table in MySQL.

mysql> select * from a9a_model1 limit 3;
+---------+---------------------+
| feature | weight              |
+---------+---------------------+
|       0 | -0.5761121511459351 |
|       1 | -1.5259535312652588 |
|      10 | 0.21053194999694824 |
+---------+---------------------+
3 rows in set (0.00 sec)

We recommend to create an index of model tables to boost lookups in online prediction.

CREATE UNIQUE INDEX a9a_model1_feature_index on a9a_model1 (feature);
-- USING BTREE;

Exporting test data from Hadoop to MySQL (optional step)

Prepare a testing data table in Hive which is being exported.

create table a9atest_exploded_tsv
  ROW FORMAT DELIMITED 
    FIELDS TERMINATED BY "\t"
    LINES TERMINATED BY "\n"
  STORED AS TEXTFILE
AS
select
  rowid, 
  -- label, 
  extract_feature(feature) as feature,
  extract_weight(feature) as value
from
  a9atest LATERAL VIEW explode(addBias(features)) t AS feature;

desc extended a9atest_exploded_tsv;
> location:hdfs://dm01:9000/user/hive/warehouse/a9a.db/a9atest_exploded_tsv,

Prepare a test table, importing data from Hadoop.

use a9a;

create table a9atest_exploded (
  rowid bigint,
  feature int, 
  value double
);

Then, run Sqoop to export data from HDFS to MySQL.

export MYSQL_HOST=dm01

bin/sqoop export \
--connect jdbc:mysql://${MYSQL_HOST}/a9a \
--username sqoop --password sqoop \
--table a9atest_exploded \
--export-dir /user/hive/warehouse/a9a.db/a9atest_exploded_tsv \
--input-fields-terminated-by '\t' --input-lines-terminated-by '\n' \
--batch

Better to add an index to the rowid column to boost selection by rowids.

CREATE INDEX a9atest_exploded_rowid_index on a9atest_exploded (rowid) USING BTREE;

When the exporting successfully finishes, you can find entries in the test table in MySQL.

mysql> select * from a9atest_exploded limit 10;
+-------+---------+-------+
| rowid | feature | value |
+-------+---------+-------+
| 12427 |      67 |     1 |
| 12427 |      73 |     1 |
| 12427 |      74 |     1 |
| 12427 |      76 |     1 |
| 12427 |      82 |     1 |
| 12427 |      83 |     1 |
| 12427 |       0 |     1 |
| 12428 |       5 |     1 |
| 12428 |       7 |     1 |
| 12428 |      16 |     1 |
+-------+---------+-------+
10 rows in set (0.00 sec)

Online/realtime prediction on MySQL

Define sigmoid function used for a prediction of logistic regression as follows:

DROP FUNCTION IF EXISTS sigmoid;
DELIMITER $$
CREATE FUNCTION sigmoid(x DOUBLE)
  RETURNS DOUBLE
  LANGUAGE SQL
BEGIN
  RETURN 1.0 / (1.0 + EXP(-x));
END;
$$
DELIMITER ;

We assume here that doing prediction for a 'features' having (0,1,10) and each of them is a categorical feature (i.e., the weight is 1.0). Then, you can get the probability by logistic regression simply as follows:

select
  sigmoid(sum(m.weight)) as prob
from
  a9a_model1 m
where
  m.feature in (0,1,10);
+--------------------+
| prob               |
+--------------------+
| 0.1310696931351625 |
+--------------------+
1 row in set (0.00 sec)

Similar to the way in Hive, you can run prediction as follows:

select
  sigmoid(sum(t.value * m.weight)) as prob, 
  if(sigmoid(sum(t.value * m.weight)) > 0.5, 1.0, 0.0) as predicted
from
  a9atest_exploded t LEFT OUTER JOIN
  a9a_model1 m ON (t.feature = m.feature)
where
  t.rowid = 12427; -- prediction on a particular id

Alternatively, you can use SQL views for testing target 't'.

+---------------------+-----------+
| prob                | predicted |
+---------------------+-----------+
| 0.05595205126313402 |       0.0 |
+---------------------+-----------+
1 row in set (0.00 sec)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment