tdhopper/.dockerignore

## .dockerignore
data/
.env/
__pycache__
.ipynb_checkpoints

## .gitignore
Training Report.pdf
*.tar.gz
handler_venv/
*.nbconvert.*
static/
env/
*.pkl
.env
.ipynb_checkpoints
**/__pycache__

## .python-version
3.8.0

## README.md

      
    Raw
  

              README.md
            
          
    Energy Efficiency Predictor

Problem


The goal of this challenge is to build a regression model and deploy it with docker. The dataset you will use for the challenge is available at https://archive.ics.uci.edu/ml/datasets/Energy+efficiency. You should be able to run the docker image and then curl the container by sending json containing the attributes of a new building and get a json response with the heating and cooling loads predicted by your trained model. The code should be written in python but you can use whichever libraries you like to train and deploy the model

Solution

I built a simple multivariate, Lasso model with Scikit-learn that is served with Flask.
You can see the notebook used to train the model at https://9whioydhmb.execute-api.us-east-1.amazonaws.com/carbonrelay.
Model can be tested by posting a dictionary of input values to an AWS endpoint, e.g.:
curl -d '{
    "relative_compactness": 0.98,
    "surface_area": 514.5,
    "wall_area": 294.0,
    "roof_area": 110.25,
    "overall_height": 7.0,
    "orientation": 2.0,
    "glazing_area": 0.0,
    "glazing_area_distribution": 0.0
}' -H 'Content-Type: application/json' https://9whioydhmb.execute-api.us-east-1.amazonaws.com/carbonrelay/predict
HTTP response is JSON containing an heating_load and cooling_load field.
Building and Testing

Prerequisites


Running Docker client

Steps

Run $ docker-compose up to train model and open webserver on port 5000. Once this completes, you
should be able to run:
curl -d '{
    "relative_compactness": 0.98,
    "surface_area": 514.5,
    "wall_area": 294.0,
    "roof_area": 110.25,
    "overall_height": 7.0,
    "orientation": 2.0,
    "glazing_area": 0.0,
    "glazing_area_distribution": 0.0
}' -H 'Content-Type: application/json' http://127.0.0.1:5000/predict
You can also view the Jupyter notebook with training information and model performance information at http://127.0.0.1:5000/.

  
## app.py
import joblib
import pandas as pd

from flask import Flask, request

app = Flask("energy_efficiency")


@app.route("/")
def report():
    return app.send_static_file("Train.html")


@app.route("/predict", methods=["POST"])
def predict():
    data = pd.DataFrame.from_records([request.get_json()])
    pipe = joblib.load("model.pkl")
    output = pipe.predict(data)
    assert output.shape == (1, 2)
    return {
        "heating_load": output[0][0],
        "cooling_load": output[0][1],
    }


## docker-compose.yml
version: '3'
services:
  web-server:
    build: .
    entrypoint: flask run -h 0.0.0.0
    ports:
      - "5000:5000"
    volumes:
      - .:/home
  train:
    build: .
    entrypoint: /bin/sh
    command: /home/train.sh
    volumes:
        - .:/home

## Dockerfile
FROM fnndsc/ubuntu-python3
RUN apt -qq install --yes build-essential
RUN pip install --quiet --upgrade pip
WORKDIR /home
COPY . ./
RUN pip install -r requirements.txt
ENV FLASK_APP=app
ENV LC_ALL=C.UTF-8
ENV LANG=C.UTF-8

## ENB2012_data.xlsx

      
    Raw
  

              ENB2012_data.xlsx
            
          
            View raw
        
    
## requirements.txt
scikit-learn
flask
jupyter
pandas
xlrd

## Train.ipynb
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Train dummy model and linear regression model on Energy Efficiency dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import joblib\n",
    "\n",
    "from sklearn.linear_model import LinearRegression, Lasso\n",
    "from sklearn.model_selection import cross_val_score\n",
    "from sklearn.pipeline import Pipeline\n",
    "from sklearn.dummy import DummyRegressor\n",
    "from sklearn.preprocessing import StandardScaler, OneHotEncoder\n",
    "from sklearn.compose import ColumnTransformer"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [],
   "source": [
    "feature_columns = [\n",
    "    \"relative_compactness\",\n",
    "    \"surface_area\",\n",
    "    \"wall_area\",\n",
    "    \"roof_area\",\n",
    "    \"overall_height\",\n",
    "    \"orientation\",\n",
    "    \"glazing_area\",\n",
    "    \"glazing_area_distribution\",\n",
    "]\n",
    "\n",
    "output_columns = [\n",
    "    \"heating_load\",\n",
    "    \"cooling_load\",\n",
    "]\n",
    "\n",
    "data = pd.read_excel(\"ENB2012_data.xlsx\", \n",
    "                     names=feature_columns + output_columns,\n",
    "                    )"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>relative_compactness</th>\n",
       "      <th>surface_area</th>\n",
       "      <th>wall_area</th>\n",
       "      <th>roof_area</th>\n",
       "      <th>overall_height</th>\n",
       "      <th>orientation</th>\n",
       "      <th>glazing_area</th>\n",
       "      <th>glazing_area_distribution</th>\n",
       "      <th>heating_load</th>\n",
       "      <th>cooling_load</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>count</th>\n",
       "      <td>768.000000</td>\n",
       "      <td>768.000000</td>\n",
       "      <td>768.000000</td>\n",
       "      <td>768.000000</td>\n",
       "      <td>768.00000</td>\n",
       "      <td>768.000000</td>\n",
       "      <td>768.000000</td>\n",
       "      <td>768.00000</td>\n",
       "      <td>768.000000</td>\n",
       "      <td>768.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>mean</th>\n",
       "      <td>0.764167</td>\n",
       "      <td>671.708333</td>\n",
       "      <td>318.500000</td>\n",
       "      <td>176.604167</td>\n",
       "      <td>5.25000</td>\n",
       "      <td>3.500000</td>\n",
       "      <td>0.234375</td>\n",
       "      <td>2.81250</td>\n",
       "      <td>22.307195</td>\n",
       "      <td>24.587760</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>std</th>\n",
       "      <td>0.105777</td>\n",
       "      <td>88.086116</td>\n",
       "      <td>43.626481</td>\n",
       "      <td>45.165950</td>\n",
       "      <td>1.75114</td>\n",
       "      <td>1.118763</td>\n",
       "      <td>0.133221</td>\n",
       "      <td>1.55096</td>\n",
       "      <td>10.090204</td>\n",
       "      <td>9.513306</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>min</th>\n",
       "      <td>0.620000</td>\n",
       "      <td>514.500000</td>\n",
       "      <td>245.000000</td>\n",
       "      <td>110.250000</td>\n",
       "      <td>3.50000</td>\n",
       "      <td>2.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.00000</td>\n",
       "      <td>6.010000</td>\n",
       "      <td>10.900000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>25%</th>\n",
       "      <td>0.682500</td>\n",
       "      <td>606.375000</td>\n",
       "      <td>294.000000</td>\n",
       "      <td>140.875000</td>\n",
       "      <td>3.50000</td>\n",
       "      <td>2.750000</td>\n",
       "      <td>0.100000</td>\n",
       "      <td>1.75000</td>\n",
       "      <td>12.992500</td>\n",
       "      <td>15.620000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>50%</th>\n",
       "      <td>0.750000</td>\n",
       "      <td>673.750000</td>\n",
       "      <td>318.500000</td>\n",
       "      <td>183.750000</td>\n",
       "      <td>5.25000</td>\n",
       "      <td>3.500000</td>\n",
       "      <td>0.250000</td>\n",
       "      <td>3.00000</td>\n",
       "      <td>18.950000</td>\n",
       "      <td>22.080000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>75%</th>\n",
       "      <td>0.830000</td>\n",
       "      <td>741.125000</td>\n",
       "      <td>343.000000</td>\n",
       "      <td>220.500000</td>\n",
       "      <td>7.00000</td>\n",
       "      <td>4.250000</td>\n",
       "      <td>0.400000</td>\n",
       "      <td>4.00000</td>\n",
       "      <td>31.667500</td>\n",
       "      <td>33.132500</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>max</th>\n",
       "      <td>0.980000</td>\n",
       "      <td>808.500000</td>\n",
       "      <td>416.500000</td>\n",
       "      <td>220.500000</td>\n",
       "      <td>7.00000</td>\n",
       "      <td>5.000000</td>\n",
       "      <td>0.400000</td>\n",
       "      <td>5.00000</td>\n",
       "      <td>43.100000</td>\n",
       "      <td>48.030000</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "       relative_compactness  surface_area   wall_area   roof_area  \\\n",
       "count            768.000000    768.000000  768.000000  768.000000   \n",
       "mean               0.764167    671.708333  318.500000  176.604167   \n",
       "std                0.105777     88.086116   43.626481   45.165950   \n",
       "min                0.620000    514.500000  245.000000  110.250000   \n",
       "25%                0.682500    606.375000  294.000000  140.875000   \n",
       "50%                0.750000    673.750000  318.500000  183.750000   \n",
       "75%                0.830000    741.125000  343.000000  220.500000   \n",
       "max                0.980000    808.500000  416.500000  220.500000   \n",
       "\n",
       "       overall_height  orientation  glazing_area  glazing_area_distribution  \\\n",
       "count       768.00000   768.000000    768.000000                  768.00000   \n",
       "mean          5.25000     3.500000      0.234375                    2.81250   \n",
       "std           1.75114     1.118763      0.133221                    1.55096   \n",
       "min           3.50000     2.000000      0.000000                    0.00000   \n",
       "25%           3.50000     2.750000      0.100000                    1.75000   \n",
       "50%           5.25000     3.500000      0.250000                    3.00000   \n",
       "75%           7.00000     4.250000      0.400000                    4.00000   \n",
       "max           7.00000     5.000000      0.400000                    5.00000   \n",
       "\n",
       "       heating_load  cooling_load  \n",
       "count    768.000000    768.000000  \n",
       "mean      22.307195     24.587760  \n",
       "std       10.090204      9.513306  \n",
       "min        6.010000     10.900000  \n",
       "25%       12.992500     15.620000  \n",
       "50%       18.950000     22.080000  \n",
       "75%       31.667500     33.132500  \n",
       "max       43.100000     48.030000  "
      ]
     },
     "execution_count": 20,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data.describe()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As a sanity check, test with dummy regressor (predict the mean output) as a baseline. Do 10-fold cross validation to estimate R^2 value. We would expect R^2 values around 0."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([-0.12217272, -0.15209695, -0.04929655, -0.00102648, -0.03166066,\n",
       "       -0.06108127, -0.00106225, -0.07762155, -0.21376269, -0.00233177])"
      ]
     },
     "execution_count": 29,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "dummy = DummyRegressor()\n",
    "\n",
    "cross_val_score(dummy, \n",
    "                data[feature_columns], \n",
    "                data[output_columns], \n",
    "                cv=10)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "It appears from above that `orientation` is a categorical variable while the others are continuous. \n",
    "\n",
    "I train on a simple model: one hot encode the orientation, standardized features (remove mean \n",
    "and scale to unit variance), followed by Lasso regression"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {},
   "outputs": [],
   "source": [
    "ct = ColumnTransformer(\n",
    "    [(\"orientation\", OneHotEncoder(), [\"orientation\"])], remainder=\"passthrough\"\n",
    ")\n",
    "\n",
    "pipe = Pipeline([(\"transform\", ct), (\"norm\", StandardScaler()), (\"regress\", Lasso())])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([0.6419667 , 0.88427361, 0.87404772, 0.85625164, 0.90037526,\n",
       "       0.85565515, 0.89379651, 0.84736809, 0.83821679, 0.901613  ])"
      ]
     },
     "execution_count": 32,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# The model R^2 value increases significantly over the dummy model. \n",
    "\n",
    "cross_val_score(pipe, data[feature_columns], data[output_columns], cv=10)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['model.pkl']"
      ]
     },
     "execution_count": 33,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Train model on full training set.\n",
    "\n",
    "pipe.fit(data[feature_columns], data[output_columns])\n",
    "joblib.dump(pipe, 'model.pkl')"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.0"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}

## train.sh
#!/bin/bash

jupyter nbconvert --to html --execute Train.ipynb --ExecutePreprocessor.timeout=600
mkdir -p static
mv Train.html static

## zappa_settings.json
{
    "carbonrelay": {
        "app_function": "app.app",
        "aws_region": "us-east-1",
        "project_name": "energy_efficiency",
        "runtime": "python3.8",
        "s3_bucket": "zappa-carbon-relay-takehome",
        "slim_handler": true
    }
}
	Training Report.pdf
	*.tar.gz
	handler_venv/
	.nbconvert.
	static/
	env/
	*.pkl
	.env
	.ipynb_checkpoints
	**/__pycache__
	import joblib
	import pandas as pd

	from flask import Flask, request

	app = Flask("energy_efficiency")


	@app.route("/")
	def report():
	return app.send_static_file("Train.html")


	@app.route("/predict", methods=["POST"])
	def predict():
	data = pd.DataFrame.from_records([request.get_json()])
	pipe = joblib.load("model.pkl")
	output = pipe.predict(data)
	assert output.shape == (1, 2)
	return {
	"heating_load": output[0][0],
	"cooling_load": output[0][1],
	}
	version: '3'
	services:
	web-server:
	build: .
	entrypoint: flask run -h 0.0.0.0
	ports:
	- "5000:5000"
	volumes:
	- .:/home
	train:
	build: .
	entrypoint: /bin/sh
	command: /home/train.sh
	volumes:
	- .:/home
	FROM fnndsc/ubuntu-python3
	RUN apt -qq install --yes build-essential
	RUN pip install --quiet --upgrade pip
	WORKDIR /home
	COPY . ./
	RUN pip install -r requirements.txt
	ENV FLASK_APP=app
	ENV LC_ALL=C.UTF-8
	ENV LANG=C.UTF-8
	{
	"cells": [
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Train dummy model and linear regression model on Energy Efficiency dataset."
	]
	},
	{
	"cell_type": "code",
	"execution_count": 26,
	"metadata": {},
	"outputs": [],
	"source": [
	"import pandas as pd\n",
	"import joblib\n",
	"\n",
	"from sklearn.linear_model import LinearRegression, Lasso\n",
	"from sklearn.model_selection import cross_val_score\n",
	"from sklearn.pipeline import Pipeline\n",
	"from sklearn.dummy import DummyRegressor\n",
	"from sklearn.preprocessing import StandardScaler, OneHotEncoder\n",
	"from sklearn.compose import ColumnTransformer"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 19,
	"metadata": {},
	"outputs": [],
	"source": [
	"feature_columns = [\n",
	" \"relative_compactness\",\n",
	" \"surface_area\",\n",
	" \"wall_area\",\n",
	" \"roof_area\",\n",
	" \"overall_height\",\n",
	" \"orientation\",\n",
	" \"glazing_area\",\n",
	" \"glazing_area_distribution\",\n",
	"]\n",
	"\n",
	"output_columns = [\n",
	" \"heating_load\",\n",
	" \"cooling_load\",\n",
	"]\n",
	"\n",
	"data = pd.read_excel(\"ENB2012_data.xlsx\", \n",
	" names=feature_columns + output_columns,\n",
	" )"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 20,
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/html": [
	"<div>\n",
	"<style scoped>\n",
	" .dataframe tbody tr th:only-of-type {\n",
	" vertical-align: middle;\n",
	" }\n",
	"\n",
	" .dataframe tbody tr th {\n",
	" vertical-align: top;\n",
	" }\n",
	"\n",
	" .dataframe thead th {\n",
	" text-align: right;\n",
	" }\n",
	"</style>\n",
	"<table border=\"1\" class=\"dataframe\">\n",
	" <thead>\n",
	" <tr style=\"text-align: right;\">\n",
	" <th></th>\n",
	" <th>relative_compactness</th>\n",
	" <th>surface_area</th>\n",
	" <th>wall_area</th>\n",
	" <th>roof_area</th>\n",
	" <th>overall_height</th>\n",
	" <th>orientation</th>\n",
	" <th>glazing_area</th>\n",
	" <th>glazing_area_distribution</th>\n",
	" <th>heating_load</th>\n",
	" <th>cooling_load</th>\n",
	" </tr>\n",
	" </thead>\n",
	" <tbody>\n",
	" <tr>\n",
	" <th>count</th>\n",
	" <td>768.000000</td>\n",
	" <td>768.000000</td>\n",
	" <td>768.000000</td>\n",
	" <td>768.000000</td>\n",
	" <td>768.00000</td>\n",
	" <td>768.000000</td>\n",
	" <td>768.000000</td>\n",
	" <td>768.00000</td>\n",
	" <td>768.000000</td>\n",
	" <td>768.000000</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>mean</th>\n",
	" <td>0.764167</td>\n",
	" <td>671.708333</td>\n",
	" <td>318.500000</td>\n",
	" <td>176.604167</td>\n",
	" <td>5.25000</td>\n",
	" <td>3.500000</td>\n",
	" <td>0.234375</td>\n",
	" <td>2.81250</td>\n",
	" <td>22.307195</td>\n",
	" <td>24.587760</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>std</th>\n",
	" <td>0.105777</td>\n",
	" <td>88.086116</td>\n",
	" <td>43.626481</td>\n",
	" <td>45.165950</td>\n",
	" <td>1.75114</td>\n",
	" <td>1.118763</td>\n",
	" <td>0.133221</td>\n",
	" <td>1.55096</td>\n",
	" <td>10.090204</td>\n",
	" <td>9.513306</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>min</th>\n",
	" <td>0.620000</td>\n",
	" <td>514.500000</td>\n",
	" <td>245.000000</td>\n",
	" <td>110.250000</td>\n",
	" <td>3.50000</td>\n",
	" <td>2.000000</td>\n",
	" <td>0.000000</td>\n",
	" <td>0.00000</td>\n",
	" <td>6.010000</td>\n",
	" <td>10.900000</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>25%</th>\n",
	" <td>0.682500</td>\n",
	" <td>606.375000</td>\n",
	" <td>294.000000</td>\n",
	" <td>140.875000</td>\n",
	" <td>3.50000</td>\n",
	" <td>2.750000</td>\n",
	" <td>0.100000</td>\n",
	" <td>1.75000</td>\n",
	" <td>12.992500</td>\n",
	" <td>15.620000</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>50%</th>\n",
	" <td>0.750000</td>\n",
	" <td>673.750000</td>\n",
	" <td>318.500000</td>\n",
	" <td>183.750000</td>\n",
	" <td>5.25000</td>\n",
	" <td>3.500000</td>\n",
	" <td>0.250000</td>\n",
	" <td>3.00000</td>\n",
	" <td>18.950000</td>\n",
	" <td>22.080000</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>75%</th>\n",
	" <td>0.830000</td>\n",
	" <td>741.125000</td>\n",
	" <td>343.000000</td>\n",
	" <td>220.500000</td>\n",
	" <td>7.00000</td>\n",
	" <td>4.250000</td>\n",
	" <td>0.400000</td>\n",
	" <td>4.00000</td>\n",
	" <td>31.667500</td>\n",
	" <td>33.132500</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>max</th>\n",
	" <td>0.980000</td>\n",
	" <td>808.500000</td>\n",
	" <td>416.500000</td>\n",
	" <td>220.500000</td>\n",
	" <td>7.00000</td>\n",
	" <td>5.000000</td>\n",
	" <td>0.400000</td>\n",
	" <td>5.00000</td>\n",
	" <td>43.100000</td>\n",
	" <td>48.030000</td>\n",
	" </tr>\n",
	" </tbody>\n",
	"</table>\n",
	"</div>"
	],
	"text/plain": [
	" relative_compactness surface_area wall_area roof_area \\\n",
	"count 768.000000 768.000000 768.000000 768.000000 \n",
	"mean 0.764167 671.708333 318.500000 176.604167 \n",
	"std 0.105777 88.086116 43.626481 45.165950 \n",
	"min 0.620000 514.500000 245.000000 110.250000 \n",
	"25% 0.682500 606.375000 294.000000 140.875000 \n",
	"50% 0.750000 673.750000 318.500000 183.750000 \n",
	"75% 0.830000 741.125000 343.000000 220.500000 \n",
	"max 0.980000 808.500000 416.500000 220.500000 \n",
	"\n",
	" overall_height orientation glazing_area glazing_area_distribution \\\n",
	"count 768.00000 768.000000 768.000000 768.00000 \n",
	"mean 5.25000 3.500000 0.234375 2.81250 \n",
	"std 1.75114 1.118763 0.133221 1.55096 \n",
	"min 3.50000 2.000000 0.000000 0.00000 \n",
	"25% 3.50000 2.750000 0.100000 1.75000 \n",
	"50% 5.25000 3.500000 0.250000 3.00000 \n",
	"75% 7.00000 4.250000 0.400000 4.00000 \n",
	"max 7.00000 5.000000 0.400000 5.00000 \n",
	"\n",
	" heating_load cooling_load \n",
	"count 768.000000 768.000000 \n",
	"mean 22.307195 24.587760 \n",
	"std 10.090204 9.513306 \n",
	"min 6.010000 10.900000 \n",
	"25% 12.992500 15.620000 \n",
	"50% 18.950000 22.080000 \n",
	"75% 31.667500 33.132500 \n",
	"max 43.100000 48.030000 "
	]
	},
	"execution_count": 20,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"data.describe()"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"As a sanity check, test with dummy regressor (predict the mean output) as a baseline. Do 10-fold cross validation to estimate R^2 value. We would expect R^2 values around 0."
	]
	},
	{
	"cell_type": "code",
	"execution_count": 29,
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"array([-0.12217272, -0.15209695, -0.04929655, -0.00102648, -0.03166066,\n",
	" -0.06108127, -0.00106225, -0.07762155, -0.21376269, -0.00233177])"
	]
	},
	"execution_count": 29,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"dummy = DummyRegressor()\n",
	"\n",
	"cross_val_score(dummy, \n",
	" data[feature_columns], \n",
	" data[output_columns], \n",
	" cv=10)"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"It appears from above that `orientation` is a categorical variable while the others are continuous. \n",
	"\n",
	"I train on a simple model: one hot encode the orientation, standardized features (remove mean \n",
	"and scale to unit variance), followed by Lasso regression"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 30,
	"metadata": {},
	"outputs": [],
	"source": [
	"ct = ColumnTransformer(\n",
	" [(\"orientation\", OneHotEncoder(), [\"orientation\"])], remainder=\"passthrough\"\n",
	")\n",
	"\n",
	"pipe = Pipeline([(\"transform\", ct), (\"norm\", StandardScaler()), (\"regress\", Lasso())])"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 32,
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"array([0.6419667 , 0.88427361, 0.87404772, 0.85625164, 0.90037526,\n",
	" 0.85565515, 0.89379651, 0.84736809, 0.83821679, 0.901613 ])"
	]
	},
	"execution_count": 32,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"# The model R^2 value increases significantly over the dummy model. \n",
	"\n",
	"cross_val_score(pipe, data[feature_columns], data[output_columns], cv=10)"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 33,
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"['model.pkl']"
	]
	},
	"execution_count": 33,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"# Train model on full training set.\n",
	"\n",
	"pipe.fit(data[feature_columns], data[output_columns])\n",
	"joblib.dump(pipe, 'model.pkl')"
	]
	}
	],
	"metadata": {
	"kernelspec": {
	"display_name": "Python 3",
	"language": "python",
	"name": "python3"
	},
	"language_info": {
	"codemirror_mode": {
	"name": "ipython",
	"version": 3
	},
	"file_extension": ".py",
	"mimetype": "text/x-python",
	"name": "python",
	"nbconvert_exporter": "python",
	"pygments_lexer": "ipython3",
	"version": "3.8.0"
	}
	},
	"nbformat": 4,
	"nbformat_minor": 2
	}
	#!/bin/bash

	jupyter nbconvert --to html --execute Train.ipynb --ExecutePreprocessor.timeout=600
	mkdir -p static
	mv Train.html static
	{
	"carbonrelay": {
	"app_function": "app.app",
	"aws_region": "us-east-1",
	"project_name": "energy_efficiency",
	"runtime": "python3.8",
	"s3_bucket": "zappa-carbon-relay-takehome",
	"slim_handler": true
	}
	}