Navigation Menu

Skip to content

Instantly share code, notes, and snippets.

@amitkumarj441
Last active December 15, 2018 23:33
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save amitkumarj441/cbca3fd557db3eda8e64bc42c1894618 to your computer and use it in GitHub Desktop.
Save amitkumarj441/cbca3fd557db3eda8e64bc42c1894618 to your computer and use it in GitHub Desktop.
Spin H2O 🚀
`````````````````````````````````````````````````````````
ith@ith-ThinkPad-W520:~$ pip install h2o
Collecting h2o
Downloading https://files.pythonhosted.org/packages/6e/e4/1b34202b4887f8187f72acaa178eb4ff87982a9583008c78e1929d8a5e23/h2o-3.22.0.2.tar.gz (120.6MB)
100% |████████████████████████████████| 120.6MB 344kB/s
Requirement already satisfied: requests in ./anaconda3/lib/python3.6/site-packages (from h2o) (2.11.1)
Collecting tabulate (from h2o)
Downloading https://files.pythonhosted.org/packages/12/c2/11d6845db5edf1295bc08b2f488cf5937806586afe42936c3f34c097ebdc/tabulate-0.8.2.tar.gz (45kB)
100% |████████████████████████████████| 51kB 5.3MB/s
Requirement already satisfied: future in ./anaconda3/lib/python3.6/site-packages (from h2o) (0.16.0)
Requirement already satisfied: colorama>=0.3.8 in ./anaconda3/lib/python3.6/site-packages (from h2o) (0.3.9)
Building wheels for collected packages: h2o, tabulate
Running setup.py bdist_wheel for h2o ... done
Stored in directory: /home/ith/.cache/pip/wheels/0d/17/52/9ea300738f719aca7b88a790ce94b8c928e7c6098e72627c7f
Running setup.py bdist_wheel for tabulate ... done
Stored in directory: /home/ith/.cache/pip/wheels/2a/85/33/2f6da85d5f10614cbe5a625eab3b3aebfdf43e7b857f25f829
Successfully built h2o tabulate
Installing collected packages: tabulate, h2o
Successfully installed h2o-3.22.0.2 tabulate-0.8.2
ith@ith-ThinkPad-W520:~$ python
Python 3.6.4 |Anaconda, Inc.| (default, Jan 16 2018, 18:10:19)
[GCC 7.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import h2o
>>> h2o.init()
Checking whether there is an H2O instance running at http://localhost:54321..... not found.
Attempting to start a local H2O server...
Java Version: openjdk version "1.8.0_191"; OpenJDK Runtime Environment (build 1.8.0_191-8u191-b12-0ubuntu0.16.04.1-b12); OpenJDK 64-Bit Server VM (build 25.191-b12, mixed mode)
Starting server from /home/ith/anaconda3/lib/python3.6/site-packages/h2o/backend/bin/h2o.jar
Ice root: /tmp/tmpj_ti7qqm
JVM stdout: /tmp/tmpj_ti7qqm/h2o_ith_started_from_python.out
JVM stderr: /tmp/tmpj_ti7qqm/h2o_ith_started_from_python.err
Server is running at http://127.0.0.1:54321
Connecting to H2O server at http://127.0.0.1:54321... successful.
-------------------------- ----------------------------------------
H2O cluster uptime: 01 secs
H2O cluster timezone: Asia/Kolkata
H2O data parsing timezone: UTC
H2O cluster version: 3.22.0.2
H2O cluster version age: 23 days
H2O cluster name: H2O_from_python_ith_m9y0r2
H2O cluster total nodes: 1
H2O cluster free memory: 1.707 Gb
H2O cluster total cores: 8
H2O cluster allowed cores: 8
H2O cluster status: accepting new members, healthy
H2O connection url: http://127.0.0.1:54321
H2O connection proxy:
H2O internal security: False
H2O API Extensions: XGBoost, Algos, AutoML, Core V3, Core V4
Python version: 3.6.4 final
-------------------------- ----------------------------------------
>>> h2o.demo("glm")
-------------------------------------------------------------------------------
Demo of H2O's Generalized Linear Estimator.
This demo uploads a dataset to h2o, parses it, and shows a description.
Then it divides the dataset into training and test sets, builds a GLM
from the training set, and makes predictions for the test set.
Finally, default performance metrics are displayed.
-------------------------------------------------------------------------------
>>> # Connect to H2O
>>> h2o.init()
Checking whether there is an H2O instance running at http://localhost:54321. connected.
-------------------------- ----------------------------------------
H2O cluster uptime: 1 min 13 secs
H2O cluster timezone: Asia/Kolkata
H2O data parsing timezone: UTC
H2O cluster version: 3.22.0.2
H2O cluster version age: 23 days
H2O cluster name: H2O_from_python_ith_m9y0r2
H2O cluster total nodes: 1
H2O cluster free memory: 1.699 Gb
H2O cluster total cores: 8
H2O cluster allowed cores: 8
H2O cluster status: locked, healthy
H2O connection url: http://localhost:54321
H2O connection proxy:
H2O internal security: False
H2O API Extensions: XGBoost, Algos, AutoML, Core V3, Core V4
Python version: 3.6.4 final
-------------------------- ----------------------------------------
>>> # Upload the prostate dataset that comes included in the h2o python package
>>> prostate = h2o.load_dataset("prostate")
Parse progress: |█████████████████████████████████████████████████████████████████████████████| 100%
>>> # Print a description of the prostate data
>>> prostate.describe()
Rows:380
Cols:9
ID CAPSULE AGE RACE DPROS DCAPS PSA VOL GLEASON
------- ------------------ ------------------ ----------------- ------------------ ------------------ ------------------ ------------------ ------------------ ------------------
type int int int int int int real real int
mins 1.0 0.0 43.0 0.0 1.0 1.0 0.3 0.0 0.0
mean 190.5 0.4026315789473684 66.03947368421049 1.0868421052631572 2.2710526315789488 1.1078947368421048 15.408631578947375 15.812921052631573 6.3842105263157904
maxs 380.0 1.0 79.0 2.0 4.0 2.0 139.7 97.6 9.0
sigma 109.84079387914127 0.4910743389630552 6.527071269173311 0.3087732580252793 1.0001076181502861 0.3106564493514939 19.99757266856046 18.347619967271175 1.0919533744261092
zeros 0 227 0 3 0 0 0 167 2
missing 0 0 0 0 0 0 0 0 0
0 1.0 0.0 65.0 1.0 2.0 1.0 1.4 0.0 6.0
1 2.0 0.0 72.0 1.0 3.0 2.0 6.7 0.0 7.0
2 3.0 0.0 70.0 1.0 1.0 2.0 4.9 0.0 6.0
3 4.0 0.0 76.0 2.0 2.0 1.0 51.2 20.0 7.0
4 5.0 0.0 69.0 1.0 1.0 1.0 12.3 55.9 6.0
5 6.0 1.0 71.0 1.0 3.0 2.0 3.3 0.0 8.0
6 7.0 0.0 68.0 2.0 4.0 2.0 31.9 0.0 7.0
7 8.0 0.0 61.0 2.0 4.0 2.0 66.7 27.2 7.0
8 9.0 0.0 69.0 1.0 1.0 1.0 3.9 24.0 7.0
9 10.0 0.0 68.0 2.0 1.0 2.0 13.0 0.0 6.0
>>> # Randomly split the dataset into ~70/30, training/test sets
>>> train, test = prostate.split_frame(ratios=[0.70])
>>> # Convert the response columns to factors (for binary classification problems)
>>> train["CAPSULE"] = train["CAPSULE"].asfactor()
>>> test["CAPSULE"] = test["CAPSULE"].asfactor()
>>> # Build a (classification) GLM
>>> from h2o.estimators import H2OGeneralizedLinearEstimator
>>> prostate_glm = H2OGeneralizedLinearEstimator(family="binomial", alpha=[0.5])
>>> prostate_glm.train(x=["AGE", "RACE", "PSA", "VOL", "GLEASON"],
... y="CAPSULE", training_frame=train)
glm Model Build progress: |███████████████████████████████████████████████████████████████████| 100%
>>> # Show the model
>>> prostate_glm.show()
Model Details
=============
H2OGeneralizedLinearEstimator : Generalized Linear Modeling
Model Key: GLM_model_python_1544916549296_1
ModelMetricsBinomialGLM: glm
** Reported on train data. **
MSE: 0.17549790843172788
RMSE: 0.4189247049670476
LogLoss: 0.5203121074548108
Null degrees of freedom: 256
Residual degrees of freedom: 251
Null deviance: 346.08953451744003
Residual deviance: 267.44042323177274
AIC: 279.44042323177274
AUC: 0.7985752111965704
pr_auc: 0.743468386439608
Gini: 0.5971504223931408
Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.2834351554069239:
0 1 Error Rate
----- --- --- ------- ------------
0 102 52 0.3377 (52.0/154.0)
1 20 83 0.1942 (20.0/103.0)
Total 122 135 0.2802 (72.0/257.0)
Maximum Metrics: Maximum metrics at their respective thresholds
metric threshold value idx
--------------------------- ----------- -------- -----
max f1 0.283435 0.697479 134
max f2 0.104783 0.798122 226
max f0point5 0.548671 0.691906 69
max accuracy 0.498722 0.743191 88
max precision 0.998717 1 0
max recall 0.0981563 1 233
max specificity 0.998717 1 0
max absolute_mcc 0.283435 0.45944 134
max min_per_class_accuracy 0.420984 0.718447 116
max mean_per_class_accuracy 0.283435 0.734081 134
Gains/Lift Table: Avg response rate: 40.08 %, avg score: 40.08 %
group cumulative_data_fraction lower_threshold lift cumulative_lift response_rate score cumulative_response_rate cumulative_score capture_rate cumulative_capture_rate gain cumulative_gain
-- ------- -------------------------- ----------------- --------- ----------------- --------------- --------- -------------------------- ------------------ -------------- ------------------------- -------- -----------------
1 0.0116732 0.985441 2.49515 2.49515 1 0.991876 1 0.991876 0.0291262 0.0291262 149.515 149.515
2 0.0233463 0.958871 2.49515 2.49515 1 0.968557 1 0.980217 0.0291262 0.0582524 149.515 149.515
3 0.0311284 0.938309 2.49515 2.49515 1 0.947873 1 0.972131 0.0194175 0.0776699 149.515 149.515
4 0.0428016 0.925298 2.49515 2.49515 1 0.929792 1 0.960584 0.0291262 0.106796 149.515 149.515
5 0.0505837 0.923301 2.49515 2.49515 1 0.924608 1 0.955049 0.0194175 0.126214 149.515 149.515
6 0.101167 0.778681 2.30321 2.39918 0.923077 0.875317 0.961538 0.915183 0.116505 0.242718 130.321 139.918
7 0.151751 0.70117 1.53547 2.11128 0.615385 0.733698 0.846154 0.854688 0.0776699 0.320388 53.5474 111.128
8 0.202335 0.617197 1.53547 1.96733 0.615385 0.654417 0.788462 0.80462 0.0776699 0.398058 53.5474 96.7326
9 0.299611 0.533276 1.39728 1.78225 0.56 0.565165 0.714286 0.726875 0.135922 0.533981 39.7282 78.2247
10 0.400778 0.46393 1.24757 1.64728 0.5 0.498154 0.660194 0.66914 0.126214 0.660194 24.7573 64.7281
11 0.501946 0.305072 1.15161 1.54738 0.461538 0.401419 0.620155 0.615181 0.116505 0.776699 15.1606 54.7377
12 0.599222 0.247644 0.598835 1.39339 0.24 0.26779 0.558442 0.558786 0.0582524 0.834951 -40.1165 39.3393
13 0.700389 0.228719 0.67177 1.28916 0.269231 0.23765 0.516667 0.5124 0.0679612 0.902913 -32.823 28.9159
14 0.797665 0.190922 0.299417 1.16846 0.12 0.211326 0.468293 0.475683 0.0291262 0.932039 -70.0583 16.8458
15 0.898833 0.0992135 0.575803 1.10175 0.230769 0.136606 0.441558 0.437519 0.0582524 0.990291 -42.4197 10.1753
16 1 0.000538306 0.0959671 1 0.0384615 0.0743495 0.400778 0.400778 0.00970874 1 -90.4033 0
Scoring History:
timestamp duration iterations negative_log_likelihood objective
-- ------------------- ---------- ------------ ------------------------- -----------
2018-12-16 05:01:29 0.000 sec 0 173.045 0.673326
2018-12-16 05:01:29 0.021 sec 1 137.684 0.536127
2018-12-16 05:01:29 0.024 sec 2 133.889 0.521584
2018-12-16 05:01:29 0.026 sec 3 133.722 0.521001
2018-12-16 05:01:29 0.029 sec 4 133.72 0.520999
>>> # Predict on the test set and show the first ten predictions
>>> predictions = prostate_glm.predict(test)
>>> predictions.show()
glm prediction progress: |████████████████████████████████████████████████████████████████████| 100%
predict p0 p1
--------- -------- --------
1 0.495329 0.504671
1 0.35433 0.64567
1 0.16257 0.83743
1 0.570585 0.429415
1 0.368692 0.631308
0 0.742835 0.257165
1 0.505277 0.494723
1 0.198034 0.801966
0 0.767926 0.232074
0 0.73679 0.26321
[123 rows x 3 columns]
>>> # Show default performance metrics
>>> performance = prostate_glm.model_performance(test)
>>> performance.show()
ModelMetricsBinomialGLM: glm
** Reported on test data. **
MSE: 0.1845816027455374
RMSE: 0.4296296111135002
LogLoss: 0.5399893841613397
Null degrees of freedom: 122
Residual degrees of freedom: 117
Null deviance: 166.2047381088705
Residual deviance: 132.83738850368957
AIC: 144.83738850368957
AUC: 0.7964383561643835
pr_auc: 0.7001994986434553
Gini: 0.592876712328767
Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.23779643521579233:
0 1 Error Rate
----- --- --- ------- ------------
0 40 33 0.4521 (33.0/73.0)
1 4 46 0.08 (4.0/50.0)
Total 44 79 0.3008 (37.0/123.0)
Maximum Metrics: Maximum metrics at their respective thresholds
metric threshold value idx
--------------------------- ----------- -------- -----
max f1 0.237796 0.713178 78
max f2 0.180305 0.830508 94
max f0point5 0.476899 0.68 49
max accuracy 0.476899 0.739837 49
max precision 0.992726 1 0
max recall 0.0958891 1 108
max specificity 0.992726 1 0
max absolute_mcc 0.237796 0.479514 78
max min_per_class_accuracy 0.436354 0.739726 55
max mean_per_class_accuracy 0.436354 0.739863 55
Gains/Lift Table: Avg response rate: 40.65 %, avg score: 40.55 %
group cumulative_data_fraction lower_threshold lift cumulative_lift response_rate score cumulative_response_rate cumulative_score capture_rate cumulative_capture_rate gain cumulative_gain
-- ------- -------------------------- ----------------- ------- ----------------- --------------- --------- -------------------------- ------------------ -------------- ------------------------- ------- -----------------
1 0.0162602 0.989402 2.46 2.46 1 0.992642 1 0.992642 0.04 0.04 146 146
2 0.0243902 0.972377 2.46 2.46 1 0.978209 1 0.987831 0.02 0.06 146 146
3 0.0325203 0.956311 2.46 2.46 1 0.964953 1 0.982112 0.02 0.08 146 146
4 0.0406504 0.950283 2.46 2.46 1 0.951859 1 0.976061 0.02 0.1 146 146
5 0.0569106 0.943331 2.46 2.46 1 0.947674 1 0.967951 0.04 0.14 146 146
6 0.105691 0.833227 1.64 2.08154 0.666667 0.880921 0.846154 0.927783 0.08 0.22 64 108.154
7 0.154472 0.75037 1.23 1.81263 0.5 0.782443 0.736842 0.881886 0.06 0.28 23 81.2632
8 0.203252 0.631417 0.82 1.5744 0.333333 0.688214 0.64 0.835405 0.04 0.32 -18 57.44
9 0.300813 0.528942 1.845 1.66216 0.75 0.571819 0.675676 0.749918 0.18 0.5 84.5 66.2162
10 0.398374 0.476904 1.64 1.65673 0.666667 0.501955 0.673469 0.689192 0.16 0.66 64 65.6735
11 0.504065 0.293807 1.13538 1.54742 0.461538 0.406299 0.629032 0.629876 0.12 0.78 13.5385 54.7419
12 0.601626 0.253514 0.82 1.42946 0.333333 0.266497 0.581081 0.570949 0.08 0.86 -18 42.9459
13 0.699187 0.227448 0.82 1.34442 0.333333 0.237395 0.546512 0.524407 0.08 0.94 -18 34.4419
14 0.796748 0.161407 0.41 1.23 0.166667 0.192631 0.5 0.483781 0.04 0.98 -59 23
15 0.894309 0.0951997 0.205 1.11818 0.0833333 0.119999 0.454545 0.444096 0.02 1 -79.5 11.8182
16 1 0.0550237 0 1 0 0.0787136 0.406504 0.405478 0 1 -100 0
---- End of Demo ----
`````````````````````````````````````````````````````
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment