Skip to content

Instantly share code, notes, and snippets.

@justmlio
Last active April 2, 2018 20:48
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save justmlio/796d210e31eedaaddb40cdb12a17ad80 to your computer and use it in GitHub Desktop.
Save justmlio/796d210e31eedaaddb40cdb12a17ad80 to your computer and use it in GitHub Desktop.
JustML – Client library usage example
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# JustML – Client library usage example\n",
"\n",
"The following example demonstrates the usage of the JustML Python client library. JustML provides **automatic machine learning model selection, training and deployment in the cloud**.\n",
"\n",
"JustML finds the right scikit-learn or xgboost estimator for a given supervised machine learning problem, along with the optimal hyperparameters. It also selects data and feature preprocessors in order to build a complete machine learning pipeline. The selected models are fitted and deployed on JustML computing infrastructure and can be used to generate new predictions. For more information, check https://justml.io.\n",
"\n",
"To reproduce this example, if you haven't done so yet, you will need to [request your JustML API key here](https://justml.io/#getstarted).\n",
"\n",
"The JustML Python library is installed with `pip install justml`."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import justml\n",
"\n",
"justml.api_key = \"key-xxxxxxxxx\"\n",
"justml.activate_logging()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Example: Select a classifier and fit it to given arrays"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Create a new classifier, or retrieve existing one, named \"classifier1\":"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"uninitialized\n"
]
}
],
"source": [
"clf = justml.Classifier(name=\"classifier1\")\n",
"print(clf.status)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"If the classifier exists and is already fitted, its status is equal to \"trained\". It's not the case here, and we fit it using the digits dataset from scikit-learn, which we split into a train and a test datasets.\n",
"\n",
"Fit() tests scikit-learn and xgboost machine learning classifiers and chooses the one that performs best on the data provided. The selected model is fitted and deployed in the cloud.\n",
"\n",
"Training data is used to build the estimator model and then is immediately and permanently deleted."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"from sklearn.datasets import load_digits\n",
"from sklearn.model_selection import train_test_split\n",
"\n",
"X, y = load_digits(return_X_y=True)\n",
"X_train, X_test, y_train, y_test = train_test_split(X, y)"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"INFO:justml:Sending data to train estimator classifier1...\n",
"INFO:justml:Waiting for training to complete on JustML servers...\n",
"INFO:justml:Training is finished, and data has been deleted.\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"3min 17s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)\n"
]
}
],
"source": [
"%%timeit -n1 -r1 # measure the execution time\n",
"\n",
"if clf.status != \"trained\":\n",
" clf.fit(X=X_train, y=y_train)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"It took 3min 17s to select and build a machine learning pipeline.\n",
"\n",
"The show_pipeline() method reveals the pipeline that was built during fit. In this case, two data preprocessors were selected: one hot encoding, and imputation using the mean value. For feature preprocessing, a select_rates method was chosen. And finally, the classifer that performed best is QDA (sklearn.discriminant_analysis.QuadraticDiscriminantAnalysis). All hyperparameters are displayed along with the selected model."
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"{\n",
" \"data_preprocessor\": [\n",
" {\n",
" \"step\": \"categorical_encoding\",\n",
" \"class\": \"one_hot_encoding\",\n",
" \"args\": {\n",
" \"use_minimum_fraction\": false\n",
" }\n",
" },\n",
" {\n",
" \"step\": \"imputation\",\n",
" \"args\": {\n",
" \"strategy\": \"mean\"\n",
" }\n",
" },\n",
" {\n",
" \"step\": \"rescaling\",\n",
" \"class\": \"none\"\n",
" },\n",
" {\n",
" \"step\": \"balancing\",\n",
" \"args\": {\n",
" \"strategy\": \"none\"\n",
" }\n",
" }\n",
" ],\n",
" \"feature_preprocessor\": {\n",
" \"class\": \"select_rates\",\n",
" \"args\": {\n",
" \"alpha\": 0.06544340428506021,\n",
" \"mode\": \"fwe\",\n",
" \"score_func\": \"f_classif\"\n",
" }\n",
" },\n",
" \"classifier\": {\n",
" \"class\": \"sklearn.discriminant_analysis.QuadraticDiscriminantAnalysis\",\n",
" \"args\": {\n",
" \"reg_param\": 0.6396026761675004\n",
" }\n",
" }\n",
"}\n"
]
}
],
"source": [
"clf.show_pipeline()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The show_automl_info() method is useful to see information and statistics about the model selection.\n",
"\n",
"It displays the best validation score found, the number of target algorithms that were selected to be tested, the number of successful runs, and the number of runs that didn't succeed due to memory or runtime exceeding their limits. It is possible to configure and increase memory and runtime limits to fit each problem/dataset needs (contact JustML at hello@justml.io).\n",
"\n",
"In this case, 12 algorithms were successfully tested and compared."
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"{\n",
" \"Best validation score\": \"0.991011\",\n",
" \"Metric\": \"accuracy\",\n",
" \"Number of crashed target algorithm runs\": \"1\",\n",
" \"Number of successful target algorithm runs\": \"12\",\n",
" \"Number of target algorithm runs\": \"15\",\n",
" \"Number of target algorithms that exceeded the memory limit\": \"2\",\n",
" \"Number of target algorithms that exceeded the time limit\": \"0\"\n",
"}\n"
]
}
],
"source": [
"clf.show_automl_info()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now that the pipeline was selected, fitted to the training data, and deployed, we can use it to classify X_test:"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"2.38 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)\n"
]
}
],
"source": [
"%%timeit -n1 -r1 # measure the execution time\n",
"\n",
"predictions = clf.predict(X=X_test)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Predict took 2.38 seconds.\n",
"\n",
"Like for training, prediction data does not get stored. As soon as the results are computed, data provided as input is permanently deleted.\n",
"\n",
"Let's compare the results with y_test, the true y values for the test dataset:"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" precision recall f1-score support\n",
"\n",
" 0 1.00 0.98 0.99 53\n",
" 1 1.00 1.00 1.00 42\n",
" 2 1.00 0.98 0.99 41\n",
" 3 1.00 1.00 1.00 52\n",
" 4 0.98 1.00 0.99 47\n",
" 5 1.00 1.00 1.00 39\n",
" 6 1.00 1.00 1.00 43\n",
" 7 1.00 1.00 1.00 48\n",
" 8 0.97 1.00 0.99 37\n",
" 9 1.00 1.00 1.00 48\n",
"\n",
"avg / total 1.00 1.00 1.00 450\n",
"\n"
]
}
],
"source": [
"from sklearn.metrics import classification_report\n",
"\n",
"print(classification_report(y_test, predictions))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's take a look at the pipeline again and rebuild using the corresponding sklearn classes and functions:"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"{\n",
" \"data_preprocessor\": [\n",
" {\n",
" \"step\": \"categorical_encoding\",\n",
" \"class\": \"one_hot_encoding\",\n",
" \"args\": {\n",
" \"use_minimum_fraction\": false\n",
" }\n",
" },\n",
" {\n",
" \"step\": \"imputation\",\n",
" \"args\": {\n",
" \"strategy\": \"mean\"\n",
" }\n",
" },\n",
" {\n",
" \"step\": \"rescaling\",\n",
" \"class\": \"none\"\n",
" },\n",
" {\n",
" \"step\": \"balancing\",\n",
" \"args\": {\n",
" \"strategy\": \"none\"\n",
" }\n",
" }\n",
" ],\n",
" \"feature_preprocessor\": {\n",
" \"class\": \"select_rates\",\n",
" \"args\": {\n",
" \"alpha\": 0.06544340428506021,\n",
" \"mode\": \"fwe\",\n",
" \"score_func\": \"f_classif\"\n",
" }\n",
" },\n",
" \"classifier\": {\n",
" \"class\": \"sklearn.discriminant_analysis.QuadraticDiscriminantAnalysis\",\n",
" \"args\": {\n",
" \"reg_param\": 0.6396026761675004\n",
" }\n",
" }\n",
"}\n"
]
}
],
"source": [
"clf.show_pipeline()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's build and fit the pipeline, step by step:"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"QuadraticDiscriminantAnalysis(priors=None, reg_param=0.6396026761675004,\n",
" store_covariance=False, store_covariances=None, tol=0.0001)"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import sklearn\n",
"import sklearn.discriminant_analysis\n",
"\n",
"# One hot encoding\n",
"onehotencoder = sklearn.preprocessing.OneHotEncoder(categorical_features=[], sparse=False)\n",
"X_train_preprocessed = onehotencoder.fit_transform(X_train)\n",
"\n",
"# Imputation using the mean\n",
"imputer = sklearn.preprocessing.Imputer(strategy=\"mean\")\n",
"X_train_preprocessed = imputer.fit_transform(X_train_preprocessed)\n",
"\n",
"# Feature preprocessing (select_rates corresponds to sklearn.feature_selection.GenericUnivariateSelect)\n",
"feature_preprocessor = sklearn.feature_selection.GenericUnivariateSelect(param=0.06544340428506021, mode=\"fwe\", score_func=sklearn.feature_selection.f_classif)\n",
"X_train_preprocessed = feature_preprocessor.fit_transform(X_train_preprocessed, y_train)\n",
"\n",
"# Classifier (QDA)\n",
"classifier = sklearn.discriminant_analysis.QuadraticDiscriminantAnalysis(reg_param=0.6396026761675004)\n",
"classifier.fit(X_train_preprocessed, y_train)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We know apply the fitted pipeline to X_test, and compare the predicted outputs to y_test:"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" precision recall f1-score support\n",
"\n",
" 0 1.00 0.98 0.99 53\n",
" 1 1.00 1.00 1.00 42\n",
" 2 1.00 0.98 0.99 41\n",
" 3 1.00 1.00 1.00 52\n",
" 4 0.98 1.00 0.99 47\n",
" 5 1.00 1.00 1.00 39\n",
" 6 1.00 1.00 1.00 43\n",
" 7 1.00 1.00 1.00 48\n",
" 8 0.97 1.00 0.99 37\n",
" 9 1.00 1.00 1.00 48\n",
"\n",
"avg / total 1.00 1.00 1.00 450\n",
"\n"
]
}
],
"source": [
"X_test_preprocessed = onehotencoder.transform(X_test)\n",
"X_test_preprocessed = imputer.transform(X_test_preprocessed)\n",
"X_test_preprocessed = feature_preprocessor.transform(X_test_preprocessed)\n",
"\n",
"predictions = classifier.predict(X_test_preprocessed)\n",
"\n",
"print(classification_report(y_test, predictions))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We see we obtain the same results."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Example: Select a classifier and fit it to CSV data"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Create a new classifier, or retrieve existing one, named \"classifier2\":"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"uninitialized\n"
]
}
],
"source": [
"clf = justml.Classifier(name=\"classifier2\")\n",
"print(clf.status)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"If the classifier exists and is already fitted, its status is equal to \"trained\". It's not the case here, and we fit it using a dataset in the form of a CSV file.\n",
"\n",
"The CSV needs to contain both the features (predictors) and the response (outcome variable). It needs to have a first row with column names, where the response column has the name indicated by the col_y argument (\"y\" by default). If there are columns with non-numerical values (text), JustML will consider them as categorical variables. You don't need to encode these columns as integers – JustML can handle this for you."
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"INFO:justml:Sending data to train estimator classifier2...\n",
"INFO:justml:Waiting for training to complete on JustML servers...\n",
"INFO:justml:Training is finished, and data has been deleted.\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"3min 39s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)\n"
]
}
],
"source": [
"%%timeit -n1 -r1 # measure the execution time\n",
"\n",
"if clf.status != \"trained\":\n",
" clf.fit(csvpath=\"data.csv\", col_y=\"y\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"It took 3min 39s to select and build a machine learning pipeline.\n",
"\n",
"Display the pipeline:"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"{\n",
" \"data_preprocessor\": [\n",
" {\n",
" \"step\": \"categorical_encoding\",\n",
" \"class\": \"no_encoding\"\n",
" },\n",
" {\n",
" \"step\": \"imputation\",\n",
" \"args\": {\n",
" \"strategy\": \"mean\"\n",
" }\n",
" },\n",
" {\n",
" \"step\": \"rescaling\",\n",
" \"class\": \"normalize\"\n",
" },\n",
" {\n",
" \"step\": \"balancing\",\n",
" \"args\": {\n",
" \"strategy\": \"none\"\n",
" }\n",
" }\n",
" ],\n",
" \"feature_preprocessor\": {\n",
" \"class\": \"select_rates\",\n",
" \"args\": {\n",
" \"alpha\": 0.1,\n",
" \"mode\": \"fpr\",\n",
" \"score_func\": \"chi2\"\n",
" }\n",
" },\n",
" \"classifier\": {\n",
" \"class\": \"sklearn.neighbors.KNeighborsClassifier\",\n",
" \"args\": {\n",
" \"n_neighbors\": 4,\n",
" \"p\": 2,\n",
" \"weights\": \"uniform\"\n",
" }\n",
" }\n",
"}\n"
]
}
],
"source": [
"clf.show_pipeline()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Show model selection statistics:"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"{\n",
" \"Best validation score\": \"0.986532\",\n",
" \"Metric\": \"accuracy\",\n",
" \"Number of crashed target algorithm runs\": \"1\",\n",
" \"Number of successful target algorithm runs\": \"14\",\n",
" \"Number of target algorithm runs\": \"17\",\n",
" \"Number of target algorithms that exceeded the memory limit\": \"2\",\n",
" \"Number of target algorithms that exceeded the time limit\": \"0\"\n",
"}\n"
]
}
],
"source": [
"clf.show_automl_info()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Output the pipeline's predicted values:"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[0, 1, 2, 3, 4, 9, 6, 7, 8, 9, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9]\n",
"6.76 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)\n"
]
}
],
"source": [
"%%timeit -n1 -r1 # measure the execution time\n",
"\n",
"predictions = clf.predict(csvpath=\"data.csv\", col_y=\"y\")\n",
"print(predictions[:30])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Predict took 6.76 seconds."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Example: Select a regressor and fit it to given arrays"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This example will show how to select and build a regressor. The steps are pretty much the same as those used to fit a classifier.\n",
"\n",
"But instead of using justml.Classifier, we now use the justml.Regressor class.\n",
"\n",
"Create a new regressor, or retrive existing one, named \"regressor1\":"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"uninitialized\n"
]
}
],
"source": [
"reg = justml.Regressor(name=\"regressor1\")\n",
"print(reg.status)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"If the regressor exists and is already fitted, its status is equal to \"trained\". It's not the case here, and we fit it using the boston dataset from scikit-learn, which we split into a train and a test datasets."
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"INFO:justml:Sending data to train estimator regressor1...\n",
"INFO:justml:Waiting for training to complete on JustML servers...\n",
"INFO:justml:Training is finished, and data has been deleted.\n"
]
}
],
"source": [
"from sklearn.datasets import load_boston\n",
"\n",
"X, y = load_boston(return_X_y=True)\n",
"X_train, X_test, y_train, y_test = train_test_split(X, y)\n",
"\n",
"if reg.status != \"trained\":\n",
" reg.fit(X=X_train, y=y_train)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Show the machine learning pipeline that was built during fit:"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"{\n",
" \"data_preprocessor\": [\n",
" {\n",
" \"step\": \"categorical_encoding\",\n",
" \"class\": \"no_encoding\"\n",
" },\n",
" {\n",
" \"step\": \"imputation\",\n",
" \"args\": {\n",
" \"strategy\": \"mean\"\n",
" }\n",
" },\n",
" {\n",
" \"step\": \"rescaling\",\n",
" \"class\": \"quantile_transformer\",\n",
" \"args\": {\n",
" \"n_quantiles\": 42152,\n",
" \"output_distribution\": \"normal\"\n",
" }\n",
" }\n",
" ],\n",
" \"feature_preprocessor\": {\n",
" \"class\": \"no_preprocessing\"\n",
" },\n",
" \"regressor\": {\n",
" \"class\": \"sklearn.ensemble.GradientBoostingRegressor\",\n",
" \"args\": {\n",
" \"alpha\": 0.9575021330927016,\n",
" \"learning_rate\": 0.1616604426098248,\n",
" \"loss\": \"huber\",\n",
" \"max_depth\": 4,\n",
" \"max_features\": 0.15922214934134588,\n",
" \"max_leaf_nodes\": \"None\",\n",
" \"min_impurity_decrease\": 0.0,\n",
" \"min_samples_leaf\": 8,\n",
" \"min_samples_split\": 20,\n",
" \"min_weight_fraction_leaf\": 0.0,\n",
" \"n_estimators\": 213,\n",
" \"subsample\": 0.6969886475405643\n",
" }\n",
" }\n",
"}\n"
]
}
],
"source": [
"reg.show_pipeline()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Show model selection statistics:"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"{\n",
" \"Best validation score\": \"0.892810\",\n",
" \"Metric\": \"r2\",\n",
" \"Number of crashed target algorithm runs\": \"2\",\n",
" \"Number of successful target algorithm runs\": \"111\",\n",
" \"Number of target algorithm runs\": \"113\",\n",
" \"Number of target algorithms that exceeded the memory limit\": \"0\",\n",
" \"Number of target algorithms that exceeded the time limit\": \"0\"\n",
"}\n"
]
}
],
"source": [
"reg.show_automl_info()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Use the regressor to predict the outcome of X_test, and compare the results with y_test:"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"R2: 0.805831\n",
"MSE: 15.309361\n",
"Explained variance: 0.805832\n"
]
}
],
"source": [
"predictions = reg.predict(X=X_test)\n",
"\n",
"from sklearn.metrics import r2_score, mean_squared_error, explained_variance_score\n",
"print('R2: %.6f' % r2_score(y_test, predictions))\n",
"print('MSE: %.6f' % mean_squared_error(y_test, predictions))\n",
"print('Explained variance: %.6f' % explained_variance_score(y_test, predictions))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"That's it! If you have any questions, comments or suggestions, drop us a line at hello@justml.io."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.1"
}
},
"nbformat": 4,
"nbformat_minor": 1
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment