Skip to content

Instantly share code, notes, and snippets.

@myui
Last active August 29, 2015 14:22
Show Gist options
  • Star 2 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save myui/c25199a46c1ce2e08604 to your computer and use it in GitHub Desktop.
Save myui/c25199a46c1ce2e08604 to your computer and use it in GitHub Desktop.

First of all, make sure that your Treasure Data cluster is HDP2, not CDH4. Matrix Factorization is only supported in the up-to-date HDP2 cluster. HDP2 is allocated for users who signed Treasure Data after Feb 2015. CDH4 is allcoated for the others.

NOTE: please ask our customer support to use HDP2 if you get an error.

Data preparation

Download ml-20m.zip and unzip it.

Create database and importing raw rating data into TreasureData from CSV.

Note: --time-value is used to add a dummy time column, that is required in Treasure Data, for each row.

td db:create movielens20
td table:create movielens20m ratings

td import:auto --format csv --column-header --time-value `date +%s` --auto-create movielens20m.ratings ./ratings.csv

Preparation for making training/testing data from the original table.

td table:create movielens20m ratings2

td query -w -x \
-d movielens20m \
"INSERT OVERWRITE TABLE ratings2 \
select \
  rand(31) as rnd, \
  userid, \
  movieid, \
  rating \
from \
  ratings"

Made 80% for training and 20% for testing.

td table:create movielens20m training

td query -x \
--type hive \
-d movielens20m \
"INSERT OVERWRITE TABLE training \
select * from ratings2 \
order by rnd DESC \
limit 16000000"

Note: Please set v columns off for ratings2 through Web console to use SELECT * in the INSERT INTO.

td table:create movielens20m testing

td query -x \
--type hive \
-d movielens20m \
"INSERT OVERWRITE TABLE testing \
select * from ratings2 \
order by rnd ASC \
limit 4000263"

Training

Calculate the mean rating in the training dataset.

td query -w \
--type presto \
-d movielens20m \
"select avg(rating) from training"

3.52560165625

The above average value used in the following queries.

Run training of Matrix Factorization.

Note: TD result output (--result) is not recommended for issuing the following query. Use INSERT INTO table statement instead.

td table:create movielens20m sgd_model_f20

td query -x \
--type hive \
-d movielens20m \
"WITH training_amplified as ( \
select \
   amplify(3, cast(userid as int), cast(movieid as int), rating) as (userid, movieid, rating) \
from \
   training \
) \
INSERT OVERWRITE TABLE sgd_model_f20 \
select \
  idx, \
  array_avg(u_rank) as Pu, \
  array_avg(m_rank) as Qi, \
  avg(u_bias) as Bu, \
  avg(m_bias) as Bi \
from ( \
  select train_mf_sgd(cast(userid as int), cast(movieid as int), rating, '-factor 20 -mu 3.52560165625 -iter 50') as (idx, u_rank, m_rank, u_bias, m_bias) \
  from \
  	training_amplified \
) t \
group by idx"

Note: Currently, the signature of train_mf_sgd is train_mf_sgd(int userid, int itemid, numeric rating [, string options]). We will accept any numeric types in the next version (v0.3.2) of Hivemall.

Note: Please avoid ERROR metastore.RetryingHMSHandler: MetaException(message:NoSuchObjectException(message:Function xxxxx.train_mf_sgd does not exist)) in logs. Its not a logging issue (Hive 0.13 bug) and not a problem.

Predict

td table:create movielens20m sgd_predict_f20

td query -x \
--type hive \
-d movielens20m \
"INSERT OVERWRITE TABLE sgd_predict_f20 \
select \
    t2.actual, \
    mf_predict(if(size(t2.Pu)=0,null,t2.Pu),if(size(p2.Qi)=0,null,Qi), t2.Bu, p2.Bi, 3.52560165625) as predicted \
from ( \
    select \
      t1.userid, \
      t1.movieid, \
      t1.rating as actual, \
      p1.Pu, \
      p1.Bu \
    from \
      testing t1 LEFT OUTER JOIN sgd_model_f20 p1 \
      ON (t1.userid = p1.idx) \
) t2 \
LEFT OUTER JOIN sgd_model_f20 p2 \
ON (t2.movieid = p2.idx)"

Note: if(size(xxx)=0,null,xxx) is just a workaround and will be fixed to avoid it.

Evaluate (computes MAE and RMSE)

td query -w \
--type hive \
-d movielens20m \
"select \
  mae(predicted, actual) as mae, \
  rmse(predicted, actual) as rmse \
from \
  sgd_predict_f20"
+--------------------+--------------------+       
| mae                | rmse               |
+--------------------+--------------------+
| 0.6123907679836259 | 0.8027164481776642 |
+--------------------+--------------------+
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment