Skip to content

Instantly share code, notes, and snippets.

@xiangzhemeng
Created October 9, 2018 15:19
Show Gist options
  • Save xiangzhemeng/220b58e33220352da99140544a7ac1bf to your computer and use it in GitHub Desktop.
Save xiangzhemeng/220b58e33220352da99140544a7ac1bf to your computer and use it in GitHub Desktop.
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Index\n",
"\n",
"- [Load workspace](#Load-workspace)\n",
"- [Create / connect to an experiment](#Create-/-connect-to-an-experiment)\n",
"- [Upload data files into datastore](#Upload-data-files-into-datastore)\n",
"- [Create training scripts](#Create-training-scripts)\n",
"- [Create / connect to Linux DSVM as a compute target](#Create-/-connect-to-Linux-DSVM-as-a-compute-target)\n",
"- [Configure & Run](#Configure-&-Run)\n",
"- [Display run results](#Display-run-results)\n",
"- [Register model](#Register-model)\n",
"- [Clean up the compute target](#Clean-up-the-compute-target)"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Azure ML SDK Version: 0.1.59\n"
]
}
],
"source": [
"import os\n",
"import numpy as np\n",
"import pandas as pd\n",
"\n",
"import azureml\n",
"from azureml.core import Workspace, Run\n",
"\n",
"print(\"Azure ML SDK Version: \", azureml.core.VERSION)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Load workspace"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Found the config file in: /home/nbuser/library/aml_config/config.json\n",
"Xiangzhe-WS\twesteurope\tXiangzhe-ML\twesteurope\n"
]
}
],
"source": [
"# load workspace configuration from the config.json file in the current folder.\n",
"ws = Workspace.from_config()\n",
"print(ws.name, ws.location, ws.resource_group, ws.location, sep = '\\t')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Create / connect to an experiment"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"# create an experiment\n",
"experiment_name = 'nyc-taxi-dsvm'\n",
"\n",
"from azureml.core import Experiment\n",
"exp = Experiment(workspace = ws, name = experiment_name)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Upload data files into datastore\n",
"\n",
"Every workspace comes with a default datastore which is backed by the Azure blob storage account associated with the workspace. We can use it to transfer data from local to the cloud, and access it from the compute target (Here, our compute target is DSVM)."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"AzureFile xiangzhews1068013949 azureml-filestore-bc063c69-64a6-48ce-90f5-33cb3c8d43b2\n"
]
}
],
"source": [
"# get the default datastore\n",
"ds = ws.get_default_datastore()\n",
"print(ds.datastore_type, ds.account_name, ds.container_name)"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"$AZUREML_DATAREFERENCE_b214114d38a24588a15b66bc27d0d5df"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# upload data file(s)\n",
"ds.upload_files(['./data_after_prep.pkl'], target_path = 'nyc-taxi', overwrite = True, show_progress = True)\n",
"#ds.upload(src_dir='.', target_path='nyc-taxi', overwrite=True, show_progress=True)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Create training scripts\n",
"\n",
"### Create a script directory\n",
"\n",
"Create a directory to deliver the necessary code from local to the remote compute target."
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [],
"source": [
"script_folder = './scripts_dsvm'\n",
"os.makedirs(script_folder, exist_ok=True)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Create scripts\n",
"\n",
"To submit the job to the cluster, we should create a training script.\n",
"\n",
"_**Note**: The data path settings of DSVM and Batch AI cluster are different. Be careful !!!_"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Writing ./scripts_dsvm/train.py\n"
]
}
],
"source": [
"%%writefile $script_folder/train.py\n",
"\n",
"import os\n",
"import argparse\n",
"import numpy as np\n",
"import pandas as pd\n",
"\n",
"from sklearn import preprocessing\n",
"from sklearn.model_selection import train_test_split\n",
"from sklearn.linear_model import LinearRegression\n",
"from sklearn.metrics import mean_squared_error\n",
"from sklearn.externals import joblib\n",
"\n",
"from azureml.core import Run\n",
"\n",
"# get hold of the current run\n",
"run = Run.get_submitted_run()\n",
"\n",
"# parse arguments\n",
"parser = argparse.ArgumentParser()\n",
"parser.add_argument('--data-folder', type=str, dest='data_folder', help='data folder mounting point')\n",
"args = parser.parse_args()\n",
"\n",
"data_folder = args.data_folder\n",
"data_path = os.path.join(data_folder, 'data_after_prep.pkl')\n",
"run.log('Data path', data_path)\n",
"\n",
"# load data\n",
"pd_dataframe = pd.read_pickle(data_path)\n",
"run.log('Data loading', 'finished')\n",
"\n",
"# data processing\n",
"le = preprocessing.LabelEncoder()\n",
"le.fit([\"N\", \"Y\"])\n",
"pd_dataframe[\"store_and_fwd_flag\"] = le.transform(pd_dataframe[\"store_and_fwd_flag\"])\n",
"\n",
"le.fit([\"Monday\", \"Tuesday\", \"Wednesday\", \"Thursday\", \"Friday\", \"Saturday\", \"Sunday\"])\n",
"pd_dataframe[\"pickup_weekday\"] = le.transform(pd_dataframe[\"pickup_weekday\"])\n",
"pd_dataframe[\"dropoff_weekday\"] = le.transform(pd_dataframe[\"dropoff_weekday\"])\n",
"run.log('Data processing', 'finished')\n",
"\n",
"# load dataset into numpy arrays\n",
"y = np.array(pd_dataframe[\"trip_duration\"]).astype(float)\n",
"y = np.log(y)\n",
"X = np.array(pd_dataframe.drop([\"trip_duration\"],axis = 1))\n",
"\n",
"# normalize data\n",
"scaler = preprocessing.StandardScaler().fit(X)\n",
"X = scaler.transform(X)\n",
"run.log('Normalization', 'finished')\n",
"\n",
"# split data into train and validation datasets\n",
"X_train, X_val, y_train, y_val = train_test_split(X, y, test_size = 0.25, random_state = 20)\n",
"\n",
"# train LR model\n",
"lm = LinearRegression()\n",
"lm.fit(X_train, y_train)\n",
"run.log('Model training', 'finished')\n",
"\n",
"y_pred = lm.predict(X_val)\n",
"run.log('Prediction', 'finished')\n",
"\n",
"# evaluation\n",
"mse = mean_squared_error(y_val, y_pred)\n",
"run.log('Evaluation', 'finished')\n",
"run.log('Mean Squared Error', np.float(mse))\n",
"\n",
"os.makedirs('outputs', exist_ok=True)\n",
"# note!!! file saved in the outputs folder is automatically uploaded into experiment record\n",
"joblib.dump(value=lm, filename='outputs/nyc_taxi_model.pkl')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Create / connect to Linux DSVM as a compute target"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"found existing: mydsvm\n"
]
}
],
"source": [
"from azureml.core.compute import DsvmCompute\n",
"from azureml.core.compute_target import ComputeTargetException\n",
"\n",
"compute_target_name = 'mydsvm'\n",
"\n",
"try:\n",
" dsvm_compute = DsvmCompute(workspace=ws, name=compute_target_name)\n",
" print('found existing:', dsvm_compute.name)\n",
"except ComputeTargetException:\n",
" print('creating new.')\n",
" dsvm_config = DsvmCompute.provisioning_configuration(vm_size=\"Standard_D2_v2\")\n",
" dsvm_compute = DsvmCompute.create(ws, name=compute_target_name, provisioning_configuration=dsvm_config)\n",
" dsvm_compute.wait_for_completion(show_output=True)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Configure & Run\n",
"\n",
"Firstly, create a DataReferenceConfiguration object to inform the system what data folder to download to the copmute target."
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [],
"source": [
"from azureml.core.runconfig import DataReferenceConfiguration\n",
"dr = DataReferenceConfiguration(datastore_name=ds.name, \n",
" path_on_datastore='nyc-taxi', \n",
" mode='download', # download files from datastore to compute target\n",
" overwrite=True)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Secondly, ask the system to build a conda environment based on the dependency specification, and submit the script to run there. "
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [],
"source": [
"from azureml.core.runconfig import RunConfiguration\n",
"from azureml.core.conda_dependencies import CondaDependencies\n",
"\n",
"# create a new RunConfig object\n",
"conda_run_config = RunConfiguration(framework=\"python\")\n",
"\n",
"# Set compute target to the Linux DSVM\n",
"conda_run_config.target = dsvm_compute.name\n",
"\n",
"# set the data reference of the run configuration\n",
"conda_run_config.data_references = {ds.name: dr}\n",
"\n",
"# specify CondaDependencies obj\n",
"conda_run_config.environment.python.conda_dependencies = CondaDependencies.create(conda_packages=['numpy','pandas','scikit-learn'])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Thirdly, run the script. Once the environment is built, and if we don't change our dependencies, it will be reused in subsequent runs."
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [],
"source": [
"from azureml.core import Run\n",
"from azureml.core import ScriptRunConfig\n",
"\n",
"src = ScriptRunConfig(source_directory=script_folder, \n",
" script='train.py', \n",
" run_config=conda_run_config,\n",
" arguments=['--data-folder', str(ds.as_mount())] \n",
" )\n",
"\n",
"run = exp.submit(config=src)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Show running details."
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "9b27ffe32d6c40d5bc887966340cd5d0",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"_UserRun()"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"from azureml.train.widgets import RunDetails\n",
"RunDetails(run).show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Display run results"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"{'Data path': 'workspacefilestore/nyc-taxi/data_after_prep.pkl', 'Data loading': 'finished', 'Data processing': 'finished', 'Normalization': 'finished', 'Model training': 'finished', 'Prediction': 'finished', 'Evaluation': 'finished', 'Mean Squared Error': 0.3878969301600042}\n"
]
}
],
"source": [
"print(run.get_metrics())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Register model\n",
"\n",
"`outputs` is a special directory in that all content in this directory is automatically uploaded to your workspace. Hence, the model file will also available in the workspace.\n",
"\n",
"We can see files associated with that run with the following line."
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"['azureml-logs/60_control_log.txt', 'azureml-logs/80_driver_log.txt', 'outputs/nyc_taxi_model.pkl', 'driver_log', 'azureml-logs/azureml.log']\n"
]
}
],
"source": [
"print(run.get_file_names())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Register the model in the workspace so that we can later query, examine, and deploy this model."
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"nyc_taxi_model\tnyc_taxi_model:2\t2\n"
]
}
],
"source": [
"# register model \n",
"model = run.register_model(model_name='nyc_taxi_model', model_path='outputs/nyc_taxi_model.pkl')\n",
"print(model.name, model.id, model.version, sep = '\\t')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Clean up the compute target"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"dsvm_compute.delete()"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python [conda env:anaconda3]",
"language": "python",
"name": "conda-env-anaconda3-py"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.5"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment