Hivemall provides a batch learning scheme that builds prediction models on Apache Hadoop. The learning process itself is a batch process; however, an online/real-time prediction can be achieved by carrying a prediction on a transactional relational DBMS.
In this article, we explain how to run a real-time prediction using a relational DBMS. We assume that you have already run the a9a binary classification task.
The following table shows the type matrix of machine learning schemes and applications.
Purpose | Online/Real-time Training | Offline/Batch Training |
---|---|---|
Online/Real-time Prediction |
|
|
Offline/Batch Prediction | No/few needs? |
|
Tools like Mahout, R, SAS/SPSS, and Weka provides Batch Training and Batch Prediction schemes. Using them for Online Prediction is not straightforward. The Batch schemes are suitable for making BI purposes generating daily/weekly reports.
Hivemall provides a Batch/Offline Training scheme as well as a Batch Prediction scheme on Apache Hive.
Few tools such as Jubatus and Moa provides Online Learning (Online Training and Online Prediction) scheme.
The pure Online Learning scheme is beneficial for real-time analysis such as algorithm training and twitter real-time analysis though it is hard to get a good prediction model without iterative training (Note: Online Learning scheme does not perform iterative training).
Apparently, the combination of Offline Prediction and Online Training does not make sense.
Online machine algorithms provides in Scikit-learn can be used for Batch Training and Online Prediction.
Scikit-learn is a programming library and the prediction models need to be resident in memory that a user application runs. Because of that, users need to consider scalability problems where a lot of request come from many servers (e.g., predicting Ad-click probabilities for each client access).
Building Server/Client systems that provide a standard and convenient APIs is mandatory for business applications.
Amazon Machine Learning provides real-time prediction scheme and it cost $0.0001 for each prediction ($0.1 for 1,000 requests) + $instances/hour. The cost model is not apparently suitable for frequent predictions (e.g., predicting CTR for each page view). It is normal for adtech vendors to handle more than billions of requests (i.e., predictions) for each day.
Alternatively, Hivemall provides a solution for online prediction: Exporting prediction models to a vanilla RDBMS. You can take the benefits of RDBMSs including a standard SQL access that can be used from almost all programming languages, quick responses using (B-Tree) indexes, reliability and scalability to handle many client requests.
- MySQL
Put mysql-connector-java.jar (JDBC driver) on $SQOOP_HOME/lib.
- Sqoop
Sqoop 1.4.5 does not support Hadoop v2.6.0. So, you need to build packages for Hadoop 2.6. To do that you need to edit build.xml and ivy.xml as shown in this patch.
create database a9a;
use a9a;
create user sqoop identified by 'sqoop';
grant all privileges on a9a.* to 'sqoop'@'%' identified by 'sqoop';
flush privileges;
create table a9a_model1 (
feature int,
weight double
);
exit;
Do not forget to edit bind_address in the MySQL configuration file (/etc/mysql/my.conf) accessible from master and slave nodes of Hadoop.
Check the connectivity to MySQL server using Sqoop.
export HADOOP_HOME=/opt/hadoop
export HADOOP_CONF_DIR=${HADOOP_HOME}/etc/hadoop/
export HADOOP_COMMON_HOME=${HADOOP_HOME}
bin/sqoop list-tables --connect jdbc:mysql://localhost/a9a --username sqoop --password sqoop
Create TSV table because Sqoop cannot directory read Hive tables.
create table a9a_model1_tsv
ROW FORMAT DELIMITED
FIELDS TERMINATED BY "\t"
LINES TERMINATED BY "\n"
STORED AS TEXTFILE
AS
select * from a9a_model1;
Check the location of 'a9a_model1_tsv' as follows:
desc extended a9a_model1_tsv;
> location:hdfs://dm01:9000/user/hive/warehouse/a9a.db/a9a_model1_tsv
export MYSQL_HOST=dm01
bin/sqoop export \
--connect jdbc:mysql://${MYSQL_HOST}/a9a \
--username sqoop --password sqoop \
--table a9a_model1 \
--export-dir /user/hive/warehouse/a9a.db/a9a_model1_tsv \
--input-fields-terminated-by '\t' --input-lines-terminated-by '\n' \
--batch
When the exporting successfully finishes, you can find entries in the model table in MySQL.
mysql> select * from a9a_model1 limit 3;
+---------+---------------------+
| feature | weight |
+---------+---------------------+
| 0 | -0.5761121511459351 |
| 1 | -1.5259535312652588 |
| 10 | 0.21053194999694824 |
+---------+---------------------+
3 rows in set (0.00 sec)
We recommend to create an index of model tables to boost lookups in online prediction.
CREATE UNIQUE INDEX a9a_model1_feature_index on a9a_model1 (feature);
-- USING BTREE;
Prepare a testing data table in Hive which is being exported.
create table a9atest_exploded_tsv
ROW FORMAT DELIMITED
FIELDS TERMINATED BY "\t"
LINES TERMINATED BY "\n"
STORED AS TEXTFILE
AS
select
rowid,
-- label,
extract_feature(feature) as feature,
extract_weight(feature) as value
from
a9atest LATERAL VIEW explode(addBias(features)) t AS feature;
desc extended a9atest_exploded_tsv;
> location:hdfs://dm01:9000/user/hive/warehouse/a9a.db/a9atest_exploded_tsv,
Prepare a test table, importing data from Hadoop.
use a9a;
create table a9atest_exploded (
rowid bigint,
feature int,
value double
);
Then, run Sqoop to export data from HDFS to MySQL.
export MYSQL_HOST=dm01
bin/sqoop export \
--connect jdbc:mysql://${MYSQL_HOST}/a9a \
--username sqoop --password sqoop \
--table a9atest_exploded \
--export-dir /user/hive/warehouse/a9a.db/a9atest_exploded_tsv \
--input-fields-terminated-by '\t' --input-lines-terminated-by '\n' \
--batch
Better to add an index to the rowid column to boost selection by rowids.
CREATE INDEX a9atest_exploded_rowid_index on a9atest_exploded (rowid) USING BTREE;
When the exporting successfully finishes, you can find entries in the test table in MySQL.
mysql> select * from a9atest_exploded limit 10;
+-------+---------+-------+
| rowid | feature | value |
+-------+---------+-------+
| 12427 | 67 | 1 |
| 12427 | 73 | 1 |
| 12427 | 74 | 1 |
| 12427 | 76 | 1 |
| 12427 | 82 | 1 |
| 12427 | 83 | 1 |
| 12427 | 0 | 1 |
| 12428 | 5 | 1 |
| 12428 | 7 | 1 |
| 12428 | 16 | 1 |
+-------+---------+-------+
10 rows in set (0.00 sec)
Define sigmoid function used for a prediction of logistic regression as follows:
DROP FUNCTION IF EXISTS sigmoid;
DELIMITER $$
CREATE FUNCTION sigmoid(x DOUBLE)
RETURNS DOUBLE
LANGUAGE SQL
BEGIN
RETURN 1.0 / (1.0 + EXP(-x));
END;
$$
DELIMITER ;
We assume here that doing prediction for a 'features' having (0,1,10) and each of them is a categorical feature (i.e., the weight is 1.0). Then, you can get the probability by logistic regression simply as follows:
select
sigmoid(sum(m.weight)) as prob
from
a9a_model1 m
where
m.feature in (0,1,10);
+--------------------+
| prob |
+--------------------+
| 0.1310696931351625 |
+--------------------+
1 row in set (0.00 sec)
Similar to the way in Hive, you can run prediction as follows:
select
sigmoid(sum(t.value * m.weight)) as prob,
if(sigmoid(sum(t.value * m.weight)) > 0.5, 1.0, 0.0) as predicted
from
a9atest_exploded t LEFT OUTER JOIN
a9a_model1 m ON (t.feature = m.feature)
where
t.rowid = 12427; -- prediction on a particular id
Alternatively, you can use SQL views for testing target 't'.
+---------------------+-----------+
| prob | predicted |
+---------------------+-----------+
| 0.05595205126313402 | 0.0 |
+---------------------+-----------+
1 row in set (0.00 sec)