feaselkl/sqldb-demo.ipynb

## sqldb-demo.ipynb
{
  "cells": [
    {
      "cell_type": "markdown",
      "source": [
        "## Reading from and Writing to Azure SQL Database using Azure Machine Learning\r\n",
        "\r\n",
        "The purpose of this notebook is to demonstrate how to read from and write to Azure Machine Learning using the pipeline approach.  This assumes the following:\r\n",
        "* You have configured a Datastore named `expense_reports`.\r\n",
        "* You want to use the `workspaceblobstore` Azure Blob Storage Datastore.  You can, of course, create your own and substitute it.\r\n",
        "* You have configured an Azure Data Factory.\r\n",
        "\r\n",
        "First up, let's load some important Python libraries and obtain the workspace from our Azure ML config."
      ],
      "metadata": {
        "nteract": {
          "transient": {
            "deleting": false
          }
        }
      }
    },
    {
      "cell_type": "code",
      "source": [
        "import azureml.core\r\n",
        "from azureml.core import Workspace, Datastore, Dataset\r\n",
        "from azureml.data.datapath import DataPath\r\n",
        "from azureml.data import DataType\r\n",
        "from azureml.pipeline.steps import DataTransferStep\r\n",
        "import pandas as pd\r\n",
        "\r\n",
        "ws = Workspace.from_config()"
      ],
      "outputs": [],
      "execution_count": 21,
      "metadata": {
        "gather": {
          "logged": 1616188113540
        }
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "Retrieve the expenses and Blob Storage datastores from Azure Machine Learning."
      ],
      "metadata": {
        "nteract": {
          "transient": {
            "deleting": false
          }
        }
      }
    },
    {
      "cell_type": "code",
      "source": [
        "expenses_datastore = Datastore.get(ws, datastore_name=\"expense_reports\")\r\n",
        "blob_datastore = Datastore.get(ws, datastore_name=\"workspaceblobstore\")"
      ],
      "outputs": [],
      "execution_count": 2,
      "metadata": {
        "collapsed": true,
        "jupyter": {
          "source_hidden": false,
          "outputs_hidden": false
        },
        "nteract": {
          "transient": {
            "deleting": false
          }
        },
        "gather": {
          "logged": 1616185616652
        }
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "Query our data.  This will hit the `expenses_datastore` Datastore and will pull back data prior to 2017.  Note that we need to specify the data types for each input in `from_sql_query`, but the resulting output of this is a Pandas DataFrame, making it easy to work with."
      ],
      "metadata": {
        "nteract": {
          "transient": {
            "deleting": false
          }
        }
      }
    },
    {
      "cell_type": "code",
      "source": [
        "query = DataPath(expenses_datastore, \"\"\"\r\n",
        "SELECT\r\n",
        "    er.EmployeeID,\r\n",
        "    CONCAT(e.FirstName, ' ', e.LastName) AS EmployeeName,\r\n",
        "    ec.ExpenseCategoryID,\r\n",
        "    ec.ExpenseCategory,\r\n",
        "    er.ExpenseDate,\r\n",
        "    YEAR(er.ExpenseDate) AS ExpenseYear,\r\n",
        "    -- Python requires FLOAT values--it does not support DECIMAL\r\n",
        "    CAST(er.Amount AS FLOAT) AS Amount\r\n",
        "FROM dbo.ExpenseReport er\r\n",
        "    INNER JOIN dbo.ExpenseCategory ec\r\n",
        "        ON er.ExpenseCategoryID = ec.ExpenseCategoryID\r\n",
        "    INNER JOIN dbo.Employee e\r\n",
        "        ON e.EmployeeID = er.EmployeeID\r\n",
        "WHERE\r\n",
        "    er.ExpenseDate < '2017-01-01'\"\"\")\r\n",
        "\r\n",
        "data_types = {\r\n",
        "    'EmployeeID': DataType.to_long(),\r\n",
        "    'EmployeeName': DataType.to_string(),\r\n",
        "    'ExpenseCategoryID': DataType.to_long(),\r\n",
        "    'ExpenseCategory': DataType.to_string(),\r\n",
        "    'ExpenseDate': DataType.to_datetime('%Y-%m-%d'),\r\n",
        "    'ExpenseYear': DataType.to_long(),\r\n",
        "    'Amount': DataType.to_float()\r\n",
        "}\r\n",
        "\r\n",
        "expense_reports = Dataset.Tabular.from_sql_query(query, set_column_types=data_types).to_pandas_dataframe()"
      ],
      "outputs": [],
      "execution_count": 4,
      "metadata": {
        "collapsed": true,
        "jupyter": {
          "source_hidden": false,
          "outputs_hidden": false
        },
        "nteract": {
          "transient": {
            "deleting": false
          }
        },
        "gather": {
          "logged": 1616185778933
        }
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "This gives you a feel for what the data looks like.  It's fairly straightforward, tabular data.  Almost like I generated it for a demo or something."
      ],
      "metadata": {
        "nteract": {
          "transient": {
            "deleting": false
          }
        }
      }
    },
    {
      "cell_type": "code",
      "source": [
        "expense_reports.head()"
      ],
      "outputs": [
        {
          "output_type": "execute_result",
          "execution_count": 5,
          "data": {
            "text/plain": "   EmployeeID EmployeeName  ExpenseCategoryID   ExpenseCategory ExpenseDate  \\\n0           1  Jack Aubrey                  1  Inexpensive City  2011-01-28   \n1           1  Jack Aubrey                  1  Inexpensive City  2011-01-31   \n2           1  Jack Aubrey                  1  Inexpensive City  2011-02-10   \n3           1  Jack Aubrey                  1  Inexpensive City  2011-02-11   \n4           1  Jack Aubrey                  1  Inexpensive City  2011-02-28   \n\n   ExpenseYear  Amount  \n0         2011   22.46  \n1         2011   23.43  \n2         2011   35.40  \n3         2011   31.39  \n4         2011   20.69  ",
            "text/html": "<div>\n<style scoped>\n    .dataframe tbody tr th:only-of-type {\n        vertical-align: middle;\n    }\n\n    .dataframe tbody tr th {\n        vertical-align: top;\n    }\n\n    .dataframe thead th {\n        text-align: right;\n    }\n</style>\n<table border=\"1\" class=\"dataframe\">\n  <thead>\n    <tr style=\"text-align: right;\">\n      <th></th>\n      <th>EmployeeID</th>\n      <th>EmployeeName</th>\n      <th>ExpenseCategoryID</th>\n      <th>ExpenseCategory</th>\n      <th>ExpenseDate</th>\n      <th>ExpenseYear</th>\n      <th>Amount</th>\n    </tr>\n  </thead>\n  <tbody>\n    <tr>\n      <th>0</th>\n      <td>1</td>\n      <td>Jack Aubrey</td>\n      <td>1</td>\n      <td>Inexpensive City</td>\n      <td>2011-01-28</td>\n      <td>2011</td>\n      <td>22.46</td>\n    </tr>\n    <tr>\n      <th>1</th>\n      <td>1</td>\n      <td>Jack Aubrey</td>\n      <td>1</td>\n      <td>Inexpensive City</td>\n      <td>2011-01-31</td>\n      <td>2011</td>\n      <td>23.43</td>\n    </tr>\n    <tr>\n      <th>2</th>\n      <td>1</td>\n      <td>Jack Aubrey</td>\n      <td>1</td>\n      <td>Inexpensive City</td>\n      <td>2011-02-10</td>\n      <td>2011</td>\n      <td>35.40</td>\n    </tr>\n    <tr>\n      <th>3</th>\n      <td>1</td>\n      <td>Jack Aubrey</td>\n      <td>1</td>\n      <td>Inexpensive City</td>\n      <td>2011-02-11</td>\n      <td>2011</td>\n      <td>31.39</td>\n    </tr>\n    <tr>\n      <th>4</th>\n      <td>1</td>\n      <td>Jack Aubrey</td>\n      <td>1</td>\n      <td>Inexpensive City</td>\n      <td>2011-02-28</td>\n      <td>2011</td>\n      <td>20.69</td>\n    </tr>\n  </tbody>\n</table>\n</div>"
          },
          "metadata": {}
        }
      ],
      "execution_count": 5,
      "metadata": {
        "collapsed": true,
        "jupyter": {
          "source_hidden": false,
          "outputs_hidden": false
        },
        "nteract": {
          "transient": {
            "deleting": false
          }
        },
        "gather": {
          "logged": 1616187280826
        }
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "Load the `RandomForestRegressor` from scikit-learn and fit for `Amount` given `ExpenseCategoryID` and `ExpenseYear`.  It's a simple model but it works pretty well."
      ],
      "metadata": {
        "nteract": {
          "transient": {
            "deleting": false
          }
        }
      }
    },
    {
      "cell_type": "code",
      "source": [
        "from sklearn.ensemble import RandomForestRegressor\r\n",
        "reg = RandomForestRegressor() \r\n",
        "model = reg.fit(expense_reports[[\"ExpenseCategoryID\", \"ExpenseYear\"]], expense_reports[[\"Amount\"]].values.ravel())"
      ],
      "outputs": [
        {
          "output_type": "stream",
          "name": "stderr",
          "text": [
            "/anaconda/envs/azureml_py36/lib/python3.6/site-packages/sklearn/ensemble/forest.py:246: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.\n",
            "  \"10 in version 0.20 to 100 in 0.22.\", FutureWarning)\n"
          ]
        }
      ],
      "execution_count": 6,
      "metadata": {
        "collapsed": true,
        "jupyter": {
          "source_hidden": false,
          "outputs_hidden": false
        },
        "nteract": {
          "transient": {
            "deleting": false
          }
        },
        "gather": {
          "logged": 1616187285275
        }
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "Here's the data that we'll use for predictions.  This is expense reports from 2017 forward."
      ],
      "metadata": {
        "nteract": {
          "transient": {
            "deleting": false
          }
        }
      }
    },
    {
      "cell_type": "code",
      "source": [
        "query = DataPath(expenses_datastore, \"\"\"\r\n",
        "SELECT\r\n",
        "    er.EmployeeID,\r\n",
        "    CONCAT(e.FirstName, ' ', e.LastName) AS EmployeeName,\r\n",
        "    ec.ExpenseCategoryID,\r\n",
        "    ec.ExpenseCategory,\r\n",
        "    er.ExpenseDate,\r\n",
        "    YEAR(er.ExpenseDate) AS ExpenseYear,\r\n",
        "    -- Python requires FLOAT values--it does not support DECIMAL\r\n",
        "    CAST(er.Amount AS FLOAT) AS Amount\r\n",
        "FROM dbo.ExpenseReport er\r\n",
        "    INNER JOIN dbo.ExpenseCategory ec\r\n",
        "        ON er.ExpenseCategoryID = ec.ExpenseCategoryID\r\n",
        "    INNER JOIN dbo.Employee e\r\n",
        "        ON e.EmployeeID = er.EmployeeID\r\n",
        "WHERE\r\n",
        "    er.ExpenseDate >= '2017-01-01'\"\"\")\r\n",
        "\r\n",
        "data_types = {\r\n",
        "    'EmployeeID': DataType.to_long(),\r\n",
        "    'EmployeeName': DataType.to_string(),\r\n",
        "    'ExpenseCategoryID': DataType.to_long(),\r\n",
        "    'ExpenseCategory': DataType.to_string(),\r\n",
        "    'ExpenseDate': DataType.to_datetime('%Y-%m-%d'),\r\n",
        "    'ExpenseYear': DataType.to_long(),\r\n",
        "    'Amount': DataType.to_float()\r\n",
        "}\r\n",
        "\r\n",
        "expense_reports_to_predict = Dataset.Tabular.from_sql_query(query, set_column_types=data_types).to_pandas_dataframe()"
      ],
      "outputs": [],
      "execution_count": 9,
      "metadata": {
        "collapsed": true,
        "jupyter": {
          "source_hidden": false,
          "outputs_hidden": false
        },
        "nteract": {
          "transient": {
            "deleting": false
          }
        },
        "gather": {
          "logged": 1616187313008
        }
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "Now that we have our data, generate predictions.  Then, concatenate the `PredictedAmount` column onto the expense reports DataFrame so that we can see the inputs as well as the prediction."
      ],
      "metadata": {
        "nteract": {
          "transient": {
            "deleting": false
          }
        }
      }
    },
    {
      "cell_type": "code",
      "source": [
        "pred = pd.DataFrame({\"PredictedAmount\" : model.predict(expense_reports_to_predict[[\"ExpenseCategoryID\", \"ExpenseYear\"]]) })\r\n",
        "output_data_set = pd.concat([expense_reports_to_predict, pred], axis=1)"
      ],
      "outputs": [],
      "execution_count": 10,
      "metadata": {
        "collapsed": true,
        "jupyter": {
          "source_hidden": false,
          "outputs_hidden": false
        },
        "nteract": {
          "transient": {
            "deleting": false
          }
        },
        "gather": {
          "logged": 1616187315493
        }
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "Here's a brief view of the resulting outputs."
      ],
      "metadata": {
        "nteract": {
          "transient": {
            "deleting": false
          }
        }
      }
    },
    {
      "cell_type": "code",
      "source": [
        "output_data_set.head()"
      ],
      "outputs": [
        {
          "output_type": "execute_result",
          "execution_count": 11,
          "data": {
            "text/plain": "   EmployeeID EmployeeName  ExpenseCategoryID   ExpenseCategory ExpenseDate  \\\n0           1  Jack Aubrey                  1  Inexpensive City  2017-01-03   \n1           1  Jack Aubrey                  1  Inexpensive City  2017-01-18   \n2           1  Jack Aubrey                  1  Inexpensive City  2017-01-20   \n3           1  Jack Aubrey                  1  Inexpensive City  2017-01-24   \n4           1  Jack Aubrey                  1  Inexpensive City  2017-02-07   \n\n   ExpenseYear  Amount  PredictedAmount  \n0         2017   24.67        23.948642  \n1         2017   26.65        23.948642  \n2         2017   29.39        23.948642  \n3         2017   28.06        23.948642  \n4         2017   22.72        23.948642  ",
            "text/html": "<div>\n<style scoped>\n    .dataframe tbody tr th:only-of-type {\n        vertical-align: middle;\n    }\n\n    .dataframe tbody tr th {\n        vertical-align: top;\n    }\n\n    .dataframe thead th {\n        text-align: right;\n    }\n</style>\n<table border=\"1\" class=\"dataframe\">\n  <thead>\n    <tr style=\"text-align: right;\">\n      <th></th>\n      <th>EmployeeID</th>\n      <th>EmployeeName</th>\n      <th>ExpenseCategoryID</th>\n      <th>ExpenseCategory</th>\n      <th>ExpenseDate</th>\n      <th>ExpenseYear</th>\n      <th>Amount</th>\n      <th>PredictedAmount</th>\n    </tr>\n  </thead>\n  <tbody>\n    <tr>\n      <th>0</th>\n      <td>1</td>\n      <td>Jack Aubrey</td>\n      <td>1</td>\n      <td>Inexpensive City</td>\n      <td>2017-01-03</td>\n      <td>2017</td>\n      <td>24.67</td>\n      <td>23.948642</td>\n    </tr>\n    <tr>\n      <th>1</th>\n      <td>1</td>\n      <td>Jack Aubrey</td>\n      <td>1</td>\n      <td>Inexpensive City</td>\n      <td>2017-01-18</td>\n      <td>2017</td>\n      <td>26.65</td>\n      <td>23.948642</td>\n    </tr>\n    <tr>\n      <th>2</th>\n      <td>1</td>\n      <td>Jack Aubrey</td>\n      <td>1</td>\n      <td>Inexpensive City</td>\n      <td>2017-01-20</td>\n      <td>2017</td>\n      <td>29.39</td>\n      <td>23.948642</td>\n    </tr>\n    <tr>\n      <th>3</th>\n      <td>1</td>\n      <td>Jack Aubrey</td>\n      <td>1</td>\n      <td>Inexpensive City</td>\n      <td>2017-01-24</td>\n      <td>2017</td>\n      <td>28.06</td>\n      <td>23.948642</td>\n    </tr>\n    <tr>\n      <th>4</th>\n      <td>1</td>\n      <td>Jack Aubrey</td>\n      <td>1</td>\n      <td>Inexpensive City</td>\n      <td>2017-02-07</td>\n      <td>2017</td>\n      <td>22.72</td>\n      <td>23.948642</td>\n    </tr>\n  </tbody>\n</table>\n</div>"
          },
          "metadata": {}
        }
      ],
      "execution_count": 11,
      "metadata": {
        "collapsed": true,
        "jupyter": {
          "source_hidden": false,
          "outputs_hidden": false
        },
        "nteract": {
          "transient": {
            "deleting": false
          }
        },
        "gather": {
          "logged": 1616187317566
        }
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "Now I want to write the results to Azure SQL Database.  The thing is, though, that there's no direct way to perform that write.  The best available option (as of March of 2021) is to write the data to Azure Blob Storage and then transfer that data to Azure SQL Database.\r\n",
        "\r\n",
        "Well, to write the data to Azure Blob Storage, I first need to write it locally and then transfer.  I'll call the output `predictions.csv`."
      ],
      "metadata": {
        "nteract": {
          "transient": {
            "deleting": false
          }
        }
      }
    },
    {
      "cell_type": "code",
      "source": [
        "import os\r\n",
        "if not os.path.exists('data'):\r\n",
        "    os.mkdir('data')\r\n",
        "local_path = 'data/predictions.csv'\r\n",
        "output_data_set.to_csv(local_path, index=False)"
      ],
      "outputs": [],
      "execution_count": 12,
      "metadata": {
        "collapsed": true,
        "jupyter": {
          "source_hidden": false,
          "outputs_hidden": false
        },
        "nteract": {
          "transient": {
            "deleting": false
          }
        },
        "gather": {
          "logged": 1616187331040
        }
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "Now that the data is saved locally, we can upload it to Azure Blob Storage.  This is where I first use the `blob_datastore` Datastore.  I'm going to write it to `ExpenseReportPrediction/predictions.csv`.  If you go digging into the storage account Azure Machine Learning uses, you can find this folder."
      ],
      "metadata": {
        "nteract": {
          "transient": {
            "deleting": false
          }
        }
      }
    },
    {
      "cell_type": "code",
      "source": [
        "blob_datastore.upload(src_dir='data', target_path='ExpenseReportPrediction')"
      ],
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "Uploading an estimated of 1 files\n",
            "Uploading data/predictions.csv\n",
            "Uploaded data/predictions.csv, 1 files out of an estimated total of 1\n",
            "Uploaded 1 files\n"
          ]
        },
        {
          "output_type": "execute_result",
          "execution_count": 13,
          "data": {
            "text/plain": "$AZUREML_DATAREFERENCE_b0d75737e2fa415f8c4ecca3046804e2"
          },
          "metadata": {}
        }
      ],
      "execution_count": 13,
      "metadata": {
        "collapsed": true,
        "jupyter": {
          "source_hidden": false,
          "outputs_hidden": false
        },
        "nteract": {
          "transient": {
            "deleting": false
          }
        },
        "gather": {
          "logged": 1616187333563
        }
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "Now we want to load a few more objects in order to build out a pipeline.  Technically, we could have made training and inference steps in this pipeline as well, and that's what I'd do on a production project."
      ],
      "metadata": {
        "nteract": {
          "transient": {
            "deleting": false
          }
        }
      }
    },
    {
      "cell_type": "code",
      "source": [
        "from azureml.core import Workspace, Experiment\r\n",
        "from azureml.pipeline.core import Pipeline\r\n",
        "from azureml.data.data_reference import DataReference\r\n",
        "from azureml.pipeline.steps import DataTransferStep\r\n",
        "from azureml.data.sql_data_reference import SqlDataReference\r\n",
        "from azureml.core.compute import ComputeTarget, DataFactoryCompute "
      ],
      "outputs": [],
      "execution_count": 14,
      "metadata": {
        "collapsed": true,
        "jupyter": {
          "source_hidden": false,
          "outputs_hidden": false
        },
        "nteract": {
          "transient": {
            "deleting": false
          }
        },
        "gather": {
          "logged": 1616187338800
        }
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "We need to bring in Azure Data Factory.  To do that, we specify the location of the data factory, both its resource group (`rg`) and the Data Factory name (`adf`)."
      ],
      "metadata": {
        "nteract": {
          "transient": {
            "deleting": false
          }
        }
      }
    },
    {
      "cell_type": "code",
      "source": [
        "rg = '<Resource Group>'\r\n",
        "adf='<Data Factory>'\r\n",
        "adfcompute = 'amlcompute-adf'\r\n",
        "\r\n",
        "adfconfig = DataFactoryCompute.attach_configuration(resource_group=rg, factory_name=adf, resource_id=None)\r\n",
        "adf_compute = ComputeTarget.attach(workspace=ws, name=adfcompute, attach_configuration=adfconfig)\r\n",
        "adf_compute.wait_for_completion()"
      ],
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "Provisioning operation finished, operation \"Succeeded\"\n"
          ]
        }
      ],
      "execution_count": 15,
      "metadata": {
        "collapsed": true,
        "jupyter": {
          "source_hidden": false,
          "outputs_hidden": false
        },
        "nteract": {
          "transient": {
            "deleting": false
          }
        },
        "gather": {
          "logged": 1616187540267
        }
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "Create a `DataReference()` reference to bring in Azure Blob Storage.  We'll read all of the files from `ExpenseReportPrediction/`."
      ],
      "metadata": {
        "nteract": {
          "transient": {
            "deleting": false
          }
        }
      }
    },
    {
      "cell_type": "code",
      "source": [
        "prediction_blob_ref = DataReference(\r\n",
        "    datastore=blob_datastore,\r\n",
        "    data_reference_name=\"prediction_blob_ref\",\r\n",
        "    path_on_datastore=\"ExpenseReportPrediction/\",\r\n",
        "    mode=\"mount\",\r\n",
        "    path_on_compute=None,\r\n",
        "    overwrite=False\r\n",
        ")"
      ],
      "outputs": [],
      "execution_count": 16,
      "metadata": {
        "collapsed": true,
        "jupyter": {
          "source_hidden": false,
          "outputs_hidden": false
        },
        "nteract": {
          "transient": {
            "deleting": false
          }
        },
        "gather": {
          "logged": 1616187565403
        }
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "Bring in Azure SQL Database as a `SqlDataReference`.  We will write out to the `ExpenseReportPrediction` table in SQL Server.  Note that this table must already exist prior to executing the pipeline!"
      ],
      "metadata": {
        "nteract": {
          "transient": {
            "deleting": false
          }
        }
      }
    },
    {
      "cell_type": "code",
      "source": [
        "prediction_sql_ref = SqlDataReference(\r\n",
        "    datastore=expenses_datastore,\r\n",
        "    data_reference_name=\"prediction_sql_ref\",\r\n",
        "    sql_table=\"ExpenseReportPrediction\",\r\n",
        "    sql_query=None,\r\n",
        "    sql_stored_procedure=None,\r\n",
        "    sql_stored_procedure_params=None\r\n",
        ")"
      ],
      "outputs": [],
      "execution_count": 17,
      "metadata": {
        "collapsed": true,
        "jupyter": {
          "source_hidden": false,
          "outputs_hidden": false
        },
        "nteract": {
          "transient": {
            "deleting": false
          }
        },
        "gather": {
          "logged": 1616187567951
        }
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "This `DataTransferStep` migrates our data from Azure Blob Storage into Azure SQL Database.  We set `allow_reuse=False` here because that allows us to re-run the operation with new data (but the same code) and actually get results.  if `allow_reuse=True`, re-running this will return a completed status but not do anything new after the first time it runs."
      ],
      "metadata": {
        "nteract": {
          "transient": {
            "deleting": false
          }
        }
      }
    },
    {
      "cell_type": "code",
      "source": [
        "transfer_blob_to_sql = DataTransferStep(\r\n",
        "    name=\"transfer_blob_to_sql\",\r\n",
        "    source_data_reference=prediction_blob_ref,\r\n",
        "    destination_data_reference=prediction_sql_ref,\r\n",
        "    compute_target=adf_compute,\r\n",
        "    allow_reuse=False,\r\n",
        "    destination_reference_type=None\r\n",
        ")"
      ],
      "outputs": [],
      "execution_count": 18,
      "metadata": {
        "collapsed": true,
        "jupyter": {
          "source_hidden": false,
          "outputs_hidden": false
        },
        "nteract": {
          "transient": {
            "deleting": false
          }
        },
        "gather": {
          "logged": 1616187570661
        }
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "And here's the `Pipeline` which will do the work.  It has one step:  `transfer_blob_to_sql`."
      ],
      "metadata": {
        "nteract": {
          "transient": {
            "deleting": false
          }
        }
      }
    },
    {
      "cell_type": "code",
      "source": [
        "datatransfer_pipeline = Pipeline(workspace=ws, \r\n",
        "    steps=[transfer_blob_to_sql], \r\n",
        "    description='Transfer blob data to sql')"
      ],
      "outputs": [],
      "execution_count": 19,
      "metadata": {
        "collapsed": true,
        "jupyter": {
          "source_hidden": false,
          "outputs_hidden": false
        },
        "nteract": {
          "transient": {
            "deleting": false
          }
        },
        "gather": {
          "logged": 1616187574016
        }
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "We execute the pipeline in the context of an `Experiment`.  It takes a little while to execute and gives us a summary of what happened.  The net result here is that we inserted the data we wanted into `dbo.ExpenseReportPrediction`."
      ],
      "metadata": {
        "nteract": {
          "transient": {
            "deleting": false
          }
        }
      }
    },
    {
      "cell_type": "code",
      "source": [
        "exp = Experiment(workspace = ws, name=\"DataTransfer_BlobtoSQL\")\r\n",
        "\r\n",
        "exp_pipelinerun = exp.submit(datatransfer_pipeline)\r\n",
        "\r\n",
        "exp_pipelinerun.wait_for_completion()"
      ],
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "Created step transfer_blob_to_sql [bfc1f34a][55df7c95-5290-47a2-9ceb-1a5522616913], (This step will run and generate new outputs)\n",
            "Created data reference prediction_blob_ref for StepId [c86ee31c][848113d4-2aee-40ed-9591-238209e8c130], (Consumers of this data will generate new runs.)\n",
            "Created data reference prediction_sql_ref for StepId [62d461df][77842e16-2884-47e3-a38e-eaa9840260e8], (Consumers of this data will generate new runs.)\n",
            "Submitted PipelineRun <RunID>\n",
            "Link to Azure Machine Learning Portal: https://ml.azure.com/experiments/DataTransfer_BlobtoSQL/runs/<RunID>?wsid=/subscriptions/<Subscription>/resourcegroups/<ResourceGroup>/workspaces/<ResourceGroup>\n",
            "PipelineRunId: <RunID>\n",
            "Link to Azure Machine Learning Portal: https://ml.azure.com/experiments/DataTransfer_BlobtoSQL/runs/<RunID>?wsid=/subscriptions/<Subscription>/resourcegroups/<ResourceGroup>/workspaces/<ResourceGroup>\n",
            "PipelineRun Status: NotStarted\n",
            "PipelineRun Status: Running\n",
            "\n",
            "\n",
            "StepRunId: <StepRunID>\n",
            "Link to Azure Machine Learning Portal: https://ml.azure.com/experiments/DataTransfer_BlobtoSQL/runs/<StepRunID>?wsid=/subscriptions/<Subscription>/resourcegroups/<ResourceGroup>/workspaces/<ResourceGroup>\n",
            "StepRun( transfer_blob_to_sql ) Status: NotStarted\n",
            "StepRun( transfer_blob_to_sql ) Status: Running\n",
            "\n",
            "StepRun(transfer_blob_to_sql) Execution Summary\n",
            "================================================\n",
            "StepRun( transfer_blob_to_sql ) Status: Finished\n",
            "{'runId': '<StepRunID>', 'status': 'Completed', 'startTimeUtc': '2021-03-19T21:00:10.953527Z', 'endTimeUtc': '2021-03-19T21:01:58.049439Z', 'properties': {'ContentSnapshotId': 'e2c1f050-15b2-403a-991a-158390494b75', 'StepType': 'DataTransferStep', 'ComputeTargetType': 'ADF', 'azureml.moduleid': '55df7c95-5290-47a2-9ceb-1a5522616913', 'azureml.runsource': 'azureml.StepRun', 'azureml.nodeid': 'bfc1f34a', 'azureml.pipelinerunid': '<RunID>'}, 'inputDatasets': [], 'logFiles': {'logs/azureml/executionlogs.txt': 'https://<ResourceGroup><BlobStorageURI>/azureml/ExperimentRun/dcid.<StepRunID>/logs/azureml/executionlogs.txt?sv=2019-02-02&sr=b&sig=ESIsK%2BxnixB9sfsrluq%2Fx348NKA%2FUPYsuacZaVA9qds%3D&st=2021-03-19T20%3A50%3A00Z&se=2021-03-20T05%3A00%3A00Z&sp=r', 'logs/azureml/stderrlogs.txt': 'https://<ResourceGroup><BlobStorageURI>/azureml/ExperimentRun/dcid.<StepRunID>/logs/azureml/stderrlogs.txt?sv=2019-02-02&sr=b&sig=BZuYKpZFJegKiFmtaNIzbxkkTY8qVli27ChtuXLHpLU%3D&st=2021-03-19T20%3A50%3A00Z&se=2021-03-20T05%3A00%3A00Z&sp=r', 'logs/azureml/stdoutlogs.txt': 'https://<ResourceGroup><BlobStorageURI>/azureml/ExperimentRun/dcid.<StepRunID>/logs/azureml/stdoutlogs.txt?sv=2019-02-02&sr=b&sig=8F0XE3fwXqu%2FnYS5rap%2FePCdQBtVNoeO6kau4yxtg4Q%3D&st=2021-03-19T20%3A50%3A00Z&se=2021-03-20T05%3A00%3A00Z&sp=r'}}\n",
            "\n",
            "\n",
            "\n",
            "PipelineRun Execution Summary\n",
            "==============================\n",
            "PipelineRun Status: Finished\n",
            "{'runId': '<RunID>', 'status': 'Completed', 'startTimeUtc': '2021-03-19T20:59:49.920186Z', 'endTimeUtc': '2021-03-19T21:02:28.476176Z', 'properties': {'azureml.runsource': 'azureml.PipelineRun', 'runSource': 'SDK', 'runType': 'SDK', 'azureml.parameters': '{}'}, 'inputDatasets': [], 'logFiles': {'logs/azureml/executionlogs.txt': 'https://<ResourceGroup><BlobStorageURI>/azureml/ExperimentRun/dcid.<RunID>/logs/azureml/executionlogs.txt?sv=2019-02-02&sr=b&sig=5QBzCQaYCMzinf2Vl8PB3QfGE09aKFBxY0vK5vvKKZ0%3D&st=2021-03-19T20%3A52%3A30Z&se=2021-03-20T05%3A02%3A30Z&sp=r', 'logs/azureml/stderrlogs.txt': 'https://<ResourceGroup><BlobStorageURI>/azureml/ExperimentRun/dcid.<RunID>/logs/azureml/stderrlogs.txt?sv=2019-02-02&sr=b&sig=W2EFozN%2B7TjOFFYE%2F%2BQnM%2FXeuah5PjJ3CZlJAFIZ9j4%3D&st=2021-03-19T20%3A52%3A30Z&se=2021-03-20T05%3A02%3A30Z&sp=r', 'logs/azureml/stdoutlogs.txt': 'https://<ResourceGroup><BlobStorageURI>/azureml/ExperimentRun/dcid.<RunID>/logs/azureml/stdoutlogs.txt?sv=2019-02-02&sr=b&sig=5UH2PkoUrzAi7FkQRMyUYMXrBpFe8FRrDp8O6x41s1Y%3D&st=2021-03-19T20%3A52%3A30Z&se=2021-03-20T05%3A02%3A30Z&sp=r'}}\n",
            "\n"
          ]
        },
        {
          "output_type": "execute_result",
          "execution_count": 20,
          "data": {
            "text/plain": "'Finished'"
          },
          "metadata": {}
        }
      ],
      "execution_count": 20,
      "metadata": {
        "collapsed": true,
        "jupyter": {
          "source_hidden": false,
          "outputs_hidden": false
        },
        "nteract": {
          "transient": {
            "deleting": false
          }
        },
        "gather": {
          "logged": 1616187750411
        }
      }
    }
  ],
  "metadata": {
    "kernelspec": {
      "name": "python3-azureml",
      "language": "python",
      "display_name": "Python 3.6 - AzureML"
    },
    "language_info": {
      "name": "python",
      "version": "3.6.9",
      "mimetype": "text/x-python",
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "pygments_lexer": "ipython3",
      "nbconvert_exporter": "python",
      "file_extension": ".py"
    },
    "kernel_info": {
      "name": "python3-azureml"
    },
    "microsoft": {
      "host": {
        "AzureML": {
          "notebookHasBeenCompleted": true
        }
      }
    },
    "nteract": {
      "version": "nteract-front-end@1.0.0"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 2
}