ShawnHymel/time_series_dataset_curation.ipynb

## time_series_dataset_curation.ipynb
{
  "nbformat": 4,
  "nbformat_minor": 0,
  "metadata": {
    "colab": {
      "name": "time-series-dataset-curation.ipynb",
      "provenance": []
    },
    "kernelspec": {
      "name": "python3",
      "display_name": "Python 3"
    },
    "language_info": {
      "name": "python"
    }
  },
  "cells": [
    {
      "cell_type": "markdown",
      "source": [
        "# Dataset curation for time series\n",
        "\n",
        "[![Open In Colab <](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ShawnHymel/377f33c96876c4c7201083ed4da3c76a/raw/4678b2b5b391d2e08b455394e6e1e4c669a797bf/time_series_dataset_curation.ipynb)\n",
        "\n",
        "In the paper \"Efficient BackProp\" [1], LeCun et al. shows that we can achieve a more accurate model (e.g. artificial neural network) in less time by standarizing (i.e. to a mean of 0 and unit variance) and decorrelating our input data.\n",
        "\n",
        "However, the process of standarization assumes that the data is normally distributed (i.e. Gaussian). If our data does not follow a Gaussian distribution, we should perform normalization [2], where we divide by the range to produce a set of values between 0 and 1.\n",
        "\n",
        "Create a directory */content/dataset* and upload your entire dataset there. Run through the cells in this notebook, following all of the directions to analyze the data and create a curated dataset. Note that we perform only standardization in this notebook. \n",
        "\n",
        "The standardized data will be stored in the */content/out* directory and zipped to */content/out.zip* for easy downloading.\n",
        "\n",
        "Author: EdgeImpulse, Inc.<br>\n",
        "Date: July 28, 2022<br>\n",
        "License: Apache-2.0<br>\n",
        "\n",
        "[1] http://yann.lecun.com/exdb/publis/pdf/lecun-98b.pdf\n",
        "\n",
        "[2] https://becominghuman.ai/what-does-feature-scaling-mean-when-to-normalize-data-and-when-to-standardize-data-c3de654405ed "
      ],
      "metadata": {
        "id": "6BU8CqPaVWlP"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "## Step 1: Read data from CSV files\n",
        "\n",
        "Read each CSV, verify that the data (and header) are valid, save the data in Numpy format, and save the associated filename in a list."
      ],
      "metadata": {
        "id": "cILorJYMV86Z"
      }
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "cNsxWt2XTD7C"
      },
      "outputs": [],
      "source": [
        "import csv\n",
        "import os\n",
        "import shutil\n",
        "import random\n",
        "\n",
        "import numpy as np\n",
        "import matplotlib.pyplot as plt"
      ]
    },
    {
      "cell_type": "code",
      "source": [
        "### Settings\n",
        "\n",
        "# Path information\n",
        "HOME_PATH = \"/content\"              # Location of the working directory\n",
        "DATASET_PATH = \"/content/dataset\"   # Upload your .csv samples to this directory\n",
        "OUT_PATH = \"/content/out\"           # Where output files go (will be deleted and recreated)\n",
        "TRAIN_DIR = \"training\"              # Where to store training output files\n",
        "TEST_DIR = \"testing\"                # Where to store testing output files\n",
        "OUT_ZIP = \"/content/out.zip\"        # Where to store the zipped output files\n",
        "\n",
        "# Set aside 20% for test\n",
        "TEST_RATIO = 0.2\n",
        "\n",
        "# Seed for pseudorandomness \n",
        "SEED = 42"
      ],
      "metadata": {
        "id": "pVVOlAH6TMG1"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "### Read in .csv files to construct our data in a numpy array\n",
        "\n",
        "X_all = []\n",
        "filenames = []\n",
        "first_sample = True\n",
        "channel_names = None\n",
        "sample_shape = None\n",
        "\n",
        "# Loop through all files in our dataset\n",
        "for filename in os.listdir(DATASET_PATH):\n",
        "\n",
        "  # Check if the path is a file\n",
        "  filepath = os.path.join(DATASET_PATH, filename)\n",
        "  if not os.path.isfile(filepath):\n",
        "    continue\n",
        "\n",
        "  # Read CSV file\n",
        "  data = np.genfromtxt(filepath, \n",
        "                      dtype=float,\n",
        "                      delimiter=',',\n",
        "                      names=True)\n",
        "\n",
        "  # Get length of the sample\n",
        "  num_readings = data.shape[0]\n",
        "\n",
        "  # Extract sample rate (in milliseconds), header (without timestamp), and shape info (without \n",
        "  # timestamp) from the first sample we read\n",
        "  if first_sample:\n",
        "    channel_names = data.dtype.names\n",
        "    sample_shape = (num_readings, len(channel_names))\n",
        "    first_sample = False\n",
        "\n",
        "  # Check to make sure the new sample conforms to the first sample\n",
        "  else:\n",
        "\n",
        "    # Check header\n",
        "    if data.dtype.names != channel_names:\n",
        "      print(\"Header does not match. Skipping\", filename)\n",
        "      continue\n",
        "\n",
        "    # Check shape\n",
        "    if (num_readings, len(channel_names)) != sample_shape:\n",
        "      print(\"Shape does not match. Skipping\", filename)\n",
        "      continue\n",
        "\n",
        "  # Create sample (drop timestamp column)\n",
        "  sample = np.zeros(sample_shape)\n",
        "  for i in range(num_readings):\n",
        "    sample[i, :] = np.array(data[i].item())\n",
        "\n",
        "  # Append to our dataset\n",
        "  X_all.append(sample)\n",
        "\n",
        "  # Append the filename to our list of filenames\n",
        "  filenames.append(filename)\n",
        "\n",
        "# Convert the dataset into a numpy array\n",
        "X_all = np.array(X_all)\n",
        "\n",
        "# Get number of samples and channels\n",
        "num_samples = X_all.shape[0]\n",
        "num_channels = len(channel_names)\n",
        "\n",
        "print(\"Header:\", channel_names)\n",
        "print(\"Dataset shape:\", X_all.shape)\n",
        "print(\"Number of samples:\", num_samples)\n",
        "print(\"Number of files\", len(filenames))"
      ],
      "metadata": {
        "id": "ADgJUJOYTNwE"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "## Step 2: Split the data\n",
        "\n",
        "We should not include the test set in our analysis or scaling efforts, as that could introduce a bias."
      ],
      "metadata": {
        "id": "H2Dy4TG9Wuj2"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "### Shuffle and split dataset\n",
        "\n",
        "# Use a seed in case we want to recreate the exact results\n",
        "random.seed(SEED)\n",
        "\n",
        "# Shuffle our dataset\n",
        "X_y = list(zip(X_all, filenames))\n",
        "random.shuffle(X_y)\n",
        "X_all, filenames = zip(*X_y)\n",
        "\n",
        "# Calculate number of validation and test samples to put aside (round down)\n",
        "num_samples_test = int(TEST_RATIO * num_samples)\n",
        "\n",
        "# The first `num_samples_test` samples of the shuffled list becomes the test set\n",
        "X_test = X_all[:num_samples_test]\n",
        "filenames_test = filenames[:num_samples_test]\n",
        "\n",
        "# The remaining samples become the training set\n",
        "X_train = X_all[num_samples_test:]\n",
        "filenames_train = filenames[num_samples_test:]\n",
        "\n",
        "# Convert data to Numpy arrays\n",
        "X_train = np.asarray(X_train)\n",
        "X_test = np.asarray(X_test)\n",
        "\n",
        "# Print shapes of our sets\n",
        "print(\"X_train shape:\", X_train.shape)\n",
        "print(\"X_test shape:\", X_test.shape)"
      ],
      "metadata": {
        "id": "hg8M_tZUXLbp"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "## Step 3: Analyze the training data\n",
        "\n",
        "Look at the histograms to determine if scaling is required"
      ],
      "metadata": {
        "id": "4RBOlTX3WJN5"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "### Reshape the data (drop timestamp column)\n",
        "def flatten_data_for_analysis(X, num_channels):\n",
        "\n",
        "  # Calculate number of rows in each channel (channel = different sensor reading)\n",
        "  num_rows = X.shape[0] * X_train.shape[1]\n",
        "\n",
        "  # Combine all data in each channel\n",
        "  X_flatten = np.reshape(X, (num_rows, num_channels))\n",
        "\n",
        "  # Drop the timestamp column--it will mess up our analysis\n",
        "  X_flatten = np.delete(X_flatten, 0, axis=1)\n",
        "\n",
        "  return X_flatten"
      ],
      "metadata": {
        "id": "nBUKlHOgZfOP"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "### Examine the histograms of all the data\n",
        "\n",
        "# Settings\n",
        "num_bins = 80\n",
        "\n",
        "# Flatten the data along each channel\n",
        "X_train_flatten = flatten_data_for_analysis(X_train, num_channels)\n",
        "channel_names_no_timestamp = channel_names[1:]\n",
        "\n",
        "# Create subplots\n",
        "num_hists = len(channel_names_no_timestamp)\n",
        "fig, axs = plt.subplots(1, num_hists, figsize=(20,3))\n",
        "\n",
        "# Create histogram for each category of data\n",
        "for i in range(num_hists):\n",
        "  _ = axs[i].hist(X_train_flatten[:, i], \n",
        "                  bins=num_bins)\n",
        "  axs[i].title.set_text(channel_names_no_timestamp[i])"
      ],
      "metadata": {
        "id": "4iNVDN0nZa8s"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "This look like fairly well-behaved data with Gaussian distributions. However, look at the X-axis range (the min and max values that each channel contains in our data)."
      ],
      "metadata": {
        "id": "PZr0kps6dKZP"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "### Try the histograms with the same scale\n",
        "\n",
        "# Get the minimum and maximum values (the range)\n",
        "min_val = X_train_flatten.min()\n",
        "max_val = X_train_flatten.max()\n",
        "\n",
        "# Create subplots\n",
        "num_hists = len(channel_names_no_timestamp)\n",
        "fig, axs = plt.subplots(1, num_hists, figsize=(20,3))\n",
        "\n",
        "# Create histogram for each category of data\n",
        "for i in range(num_hists):\n",
        "  _ = axs[i].hist(X_train_flatten[:, i], \n",
        "                  bins=num_bins, \n",
        "                  range=(min_val, max_val))\n",
        "  axs[i].title.set_text(channel_names_no_timestamp[i])"
      ],
      "metadata": {
        "id": "evC25Bc3azWa"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "Whoa! If we graph using the same range, it looks like there's a lot more variance in our gyroscope data. However, that's simply becuase the gyroscope uses different units (and therefore has a different range of values) than our accelerometer. To fix this, we should standardize our data."
      ],
      "metadata": {
        "id": "NOWr6cx9dZ74"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "## Step 4: Standardize the data\n",
        "\n",
        "Perform standarization so that our data, per-channel, has a mean of 0 and a standard deviation of 1."
      ],
      "metadata": {
        "id": "ap60olHaeDBN"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "### Function to calculate dataset metrics (mean, std dev, etc.) for each channel\n",
        "def calc_metrics(X, ignore_first_col=False):\n",
        "\n",
        "  # Flatten along the channels\n",
        "  num_rows = X.shape[0] * X.shape[1]\n",
        "  X_flatten = np.reshape(X, (num_rows, num_channels))\n",
        "\n",
        "  # Calculate means, standard deviations, and ranges\n",
        "  means = np.mean(X_flatten, axis=0)\n",
        "  std_devs = np.std(X_flatten, axis=0)\n",
        "  mins = np.min(X_flatten, axis=0)\n",
        "  ranges = np.ptp(X_flatten, axis=0)\n",
        "\n",
        "  # Drop the first column if requested\n",
        "  if ignore_first_col:\n",
        "    return (means[1:], std_devs[1:], mins[1:], ranges[1:])\n",
        "  else:\n",
        "    return (means, std_devs, mins, ranges)"
      ],
      "metadata": {
        "id": "1RNw3STObUaX"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "### Function to perform standardization for a given set of data\n",
        "def standardize_data(a, mean, std_dev):\n",
        "  standardized_a = (a - mean) / std_dev\n",
        "  return standardized_a"
      ],
      "metadata": {
        "id": "25a_1gfLexyL"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "### Compute the metrics of the training data\n",
        "\n",
        "# Compute metrics (drop timestamp column)\n",
        "(means, std_devs, mins, ranges) = calc_metrics(X_train, ignore_first_col=True)\n",
        "\n",
        "# Print out the results (drop timestamp column)\n",
        "print(channel_names[1:])\n",
        "print(\"Means:\", [float(\"{:.4f}\".format(x)) for x in means])\n",
        "print(\"Std devs:\", [float(\"{:.4f}\".format(x)) for x in std_devs])\n",
        "print(\"Mins:\", [float(\"{:.4f}\".format(x)) for x in mins])\n",
        "print(\"Ranges:\", [float(\"{:.4f}\".format(x)) for x in ranges])"
      ],
      "metadata": {
        "id": "xkZoBU4uf0U4"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "**Record these values!** We will need them to perform preprocessing on the test dataset and for preprocessing raw data we capture during live inference (deployment)."
      ],
      "metadata": {
        "id": "9MxaQoqogf4-"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "### Standardize each channel (do NOT standardize the timestamp channel!)\n",
        "\n",
        "# Initialize standardized data arrays\n",
        "X_train_std = np.zeros(X_train.shape)\n",
        "X_test_std = np.zeros(X_test.shape)\n",
        "\n",
        "# Go through each channel in the training data\n",
        "for i in range(len(channel_names)):\n",
        "  \n",
        "  # Skip the timestamp channel!\n",
        "  if i == 0:\n",
        "    X_train_std[:,:,i] = X_train[:,:,i]\n",
        "\n",
        "  # Otherwise, perform standardization\n",
        "  else:\n",
        "    X_train_std[:,:,i] = standardize_data(X_train[:,:,i], \n",
        "                                          means[i - 1], \n",
        "                                          std_devs[i - 1])\n",
        "\n",
        "# Go through each channel in the test data. Notice that we use the same means\n",
        "# and standard deviations that we calculated from the training data!\n",
        "for i in range(len(channel_names)):\n",
        "  \n",
        "  # Skip the timestamp channel!\n",
        "  if i == 0:\n",
        "    X_test_std[:,:,i] = X_test[:,:,i]\n",
        "\n",
        "  # Otherwise, perform standardization\n",
        "  else:\n",
        "    X_test_std[:,:,i] = standardize_data(X_test[:,:,i], \n",
        "                                          means[i - 1], \n",
        "                                          std_devs[i - 1])\n",
        "    \n",
        "# Print shapes\n",
        "print(\"X_train_std shape:\", X_train_std.shape)\n",
        "print(\"X_test_std shape:\", X_test_std.shape)"
      ],
      "metadata": {
        "id": "hkzcbbbqgKad"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "### Examine the metrics and histograms of the newly standardized data\n",
        "\n",
        "# Compute metrics for the standardized data\n",
        "(means, std_devs, mins, ranges) = calc_metrics(X_train_std, \n",
        "                                               ignore_first_col=True)\n",
        "print(channel_names[1:])\n",
        "print(\"Means:\", [float(\"{:.4f}\".format(x)) for x in means])\n",
        "print(\"Std devs:\", [float(\"{:.4f}\".format(x)) for x in std_devs])\n",
        "print(\"Mins:\", [float(\"{:.4f}\".format(x)) for x in mins])\n",
        "print(\"Ranges:\", [float(\"{:.4f}\".format(x)) for x in ranges])\n",
        "\n",
        "# Flatten the data along each channel\n",
        "X_train_flatten = flatten_data_for_analysis(X_train_std, num_channels)\n",
        "channel_names_no_timestamp = channel_names[1:]\n",
        "\n",
        "# Create subplots\n",
        "num_hists = len(channel_names_no_timestamp)\n",
        "fig, axs = plt.subplots(1, num_hists, figsize=(20,3))\n",
        "\n",
        "# Create histogram for each category of data\n",
        "for i in range(num_hists):\n",
        "  _ = axs[i].hist(X_train_flatten[:, i], \n",
        "                  bins=num_bins)\n",
        "  axs[i].title.set_text(channel_names_no_timestamp[i])"
      ],
      "metadata": {
        "id": "4YVPATbKilPk"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "Check the metrics and histograms above. All means should be 0.0 and all standard deviations should be 1.0. Do the ranges seem more reasonable?"
      ],
      "metadata": {
        "id": "NOFRnfuCv7zI"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "## Steap 5: Store preprocessed data in CSV files"
      ],
      "metadata": {
        "id": "_EXrpdKUwKDp"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "### Function to write header and data to CSV files to given directory\n",
        "def write_csv_data(header, data, filenames, dir_path):\n",
        "\n",
        "  # Go through each filename - should be in the same order as our samples in X\n",
        "  for i, filename in enumerate(filenames):\n",
        "\n",
        "    # Write header and data (for that one sample) to the CSV file\n",
        "    file_path = os.path.join(dir_path, filename)\n",
        "    with open(file_path, 'w') as f:\n",
        "      csv_writer = csv.writer(f, delimiter=',')\n",
        "      csv_writer.writerow(header)\n",
        "      csv_writer.writerows(data[i])"
      ],
      "metadata": {
        "id": "dFG2Mo6DzVcD"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "### Delete output directory (if it exists) and recreate it\n",
        "if os.path.exists(OUT_PATH):\n",
        "  shutil.rmtree(OUT_PATH)\n",
        "os.makedirs(os.path.join(OUT_PATH, TRAIN_DIR))\n",
        "os.makedirs(os.path.join(OUT_PATH, TEST_DIR))"
      ],
      "metadata": {
        "id": "ESCFHqP4j0H8"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "### Write training and test data to .csv files in separate directories\n",
        "\n",
        "# Write out training data\n",
        "dir_path = os.path.join(OUT_PATH, TRAIN_DIR)\n",
        "write_csv_data(channel_names, X_train_std, filenames_train, dir_path)\n",
        "\n",
        "# Write out test data\n",
        "dir_path = os.path.join(OUT_PATH, TEST_DIR)\n",
        "write_csv_data(channel_names, X_test_std, filenames_test, dir_path)"
      ],
      "metadata": {
        "id": "ZtMyzFV0wOPy"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "### Zip output directory\n",
        "%cd {OUT_PATH}\n",
        "!zip -FS -r -q {OUT_ZIP} *\n",
        "%cd {HOME_PATH}"
      ],
      "metadata": {
        "id": "oWnbpkCk1Jdj"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        ""
      ],
      "metadata": {
        "id": "OcxLdV7x1lPN"
      },
      "execution_count": null,
      "outputs": []
    }
  ]
}
	{
	"nbformat": 4,
	"nbformat_minor": 0,
	"metadata": {
	"colab": {
	"name": "time-series-dataset-curation.ipynb",
	"provenance": []
	},
	"kernelspec": {
	"name": "python3",
	"display_name": "Python 3"
	},
	"language_info": {
	"name": "python"
	}
	},
	"cells": [
	{
	"cell_type": "markdown",
	"source": [
	"# Dataset curation for time series\n",
	"\n",
	"[![Open In Colab <](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ShawnHymel/377f33c96876c4c7201083ed4da3c76a/raw/4678b2b5b391d2e08b455394e6e1e4c669a797bf/time_series_dataset_curation.ipynb)\n",
	"\n",
	"In the paper \"Efficient BackProp\" [1], LeCun et al. shows that we can achieve a more accurate model (e.g. artificial neural network) in less time by standarizing (i.e. to a mean of 0 and unit variance) and decorrelating our input data.\n",
	"\n",
	"However, the process of standarization assumes that the data is normally distributed (i.e. Gaussian). If our data does not follow a Gaussian distribution, we should perform normalization [2], where we divide by the range to produce a set of values between 0 and 1.\n",
	"\n",
	"Create a directory /content/dataset and upload your entire dataset there. Run through the cells in this notebook, following all of the directions to analyze the data and create a curated dataset. Note that we perform only standardization in this notebook. \n",
	"\n",
	"The standardized data will be stored in the /content/out directory and zipped to /content/out.zip for easy downloading.\n",
	"\n",
	"Author: EdgeImpulse, Inc.<br>\n",
	"Date: July 28, 2022<br>\n",
	"License: Apache-2.0<br>\n",
	"\n",
	"[1] http://yann.lecun.com/exdb/publis/pdf/lecun-98b.pdf\n",
	"\n",
	"[2] https://becominghuman.ai/what-does-feature-scaling-mean-when-to-normalize-data-and-when-to-standardize-data-c3de654405ed "
	],
	"metadata": {
	"id": "6BU8CqPaVWlP"
	}
	},
	{
	"cell_type": "markdown",
	"source": [
	"## Step 1: Read data from CSV files\n",
	"\n",
	"Read each CSV, verify that the data (and header) are valid, save the data in Numpy format, and save the associated filename in a list."
	],
	"metadata": {
	"id": "cILorJYMV86Z"
	}
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"id": "cNsxWt2XTD7C"
	},
	"outputs": [],
	"source": [
	"import csv\n",
	"import os\n",
	"import shutil\n",
	"import random\n",
	"\n",
	"import numpy as np\n",
	"import matplotlib.pyplot as plt"
	]
	},
	{
	"cell_type": "code",
	"source": [
	"### Settings\n",
	"\n",
	"# Path information\n",
	"HOME_PATH = \"/content\" # Location of the working directory\n",
	"DATASET_PATH = \"/content/dataset\" # Upload your .csv samples to this directory\n",
	"OUT_PATH = \"/content/out\" # Where output files go (will be deleted and recreated)\n",
	"TRAIN_DIR = \"training\" # Where to store training output files\n",
	"TEST_DIR = \"testing\" # Where to store testing output files\n",
	"OUT_ZIP = \"/content/out.zip\" # Where to store the zipped output files\n",
	"\n",
	"# Set aside 20% for test\n",
	"TEST_RATIO = 0.2\n",
	"\n",
	"# Seed for pseudorandomness \n",
	"SEED = 42"
	],
	"metadata": {
	"id": "pVVOlAH6TMG1"
	},
	"execution_count": null,
	"outputs": []
	},
	{
	"cell_type": "code",
	"source": [
	"### Read in .csv files to construct our data in a numpy array\n",
	"\n",
	"X_all = []\n",
	"filenames = []\n",
	"first_sample = True\n",
	"channel_names = None\n",
	"sample_shape = None\n",
	"\n",
	"# Loop through all files in our dataset\n",
	"for filename in os.listdir(DATASET_PATH):\n",
	"\n",
	" # Check if the path is a file\n",
	" filepath = os.path.join(DATASET_PATH, filename)\n",
	" if not os.path.isfile(filepath):\n",
	" continue\n",
	"\n",
	" # Read CSV file\n",
	" data = np.genfromtxt(filepath, \n",
	" dtype=float,\n",
	" delimiter=',',\n",
	" names=True)\n",
	"\n",
	" # Get length of the sample\n",
	" num_readings = data.shape[0]\n",
	"\n",
	" # Extract sample rate (in milliseconds), header (without timestamp), and shape info (without \n",
	" # timestamp) from the first sample we read\n",
	" if first_sample:\n",
	" channel_names = data.dtype.names\n",
	" sample_shape = (num_readings, len(channel_names))\n",
	" first_sample = False\n",
	"\n",
	" # Check to make sure the new sample conforms to the first sample\n",
	" else:\n",
	"\n",
	" # Check header\n",
	" if data.dtype.names != channel_names:\n",
	" print(\"Header does not match. Skipping\", filename)\n",
	" continue\n",
	"\n",
	" # Check shape\n",
	" if (num_readings, len(channel_names)) != sample_shape:\n",
	" print(\"Shape does not match. Skipping\", filename)\n",
	" continue\n",
	"\n",
	" # Create sample (drop timestamp column)\n",
	" sample = np.zeros(sample_shape)\n",
	" for i in range(num_readings):\n",
	" sample[i, :] = np.array(data[i].item())\n",
	"\n",
	" # Append to our dataset\n",
	" X_all.append(sample)\n",
	"\n",
	" # Append the filename to our list of filenames\n",
	" filenames.append(filename)\n",
	"\n",
	"# Convert the dataset into a numpy array\n",
	"X_all = np.array(X_all)\n",
	"\n",
	"# Get number of samples and channels\n",
	"num_samples = X_all.shape[0]\n",
	"num_channels = len(channel_names)\n",
	"\n",
	"print(\"Header:\", channel_names)\n",
	"print(\"Dataset shape:\", X_all.shape)\n",
	"print(\"Number of samples:\", num_samples)\n",
	"print(\"Number of files\", len(filenames))"
	],
	"metadata": {
	"id": "ADgJUJOYTNwE"
	},
	"execution_count": null,
	"outputs": []
	},
	{
	"cell_type": "markdown",
	"source": [
	"## Step 2: Split the data\n",
	"\n",
	"We should not include the test set in our analysis or scaling efforts, as that could introduce a bias."
	],
	"metadata": {
	"id": "H2Dy4TG9Wuj2"
	}
	},
	{
	"cell_type": "code",
	"source": [
	"### Shuffle and split dataset\n",
	"\n",
	"# Use a seed in case we want to recreate the exact results\n",
	"random.seed(SEED)\n",
	"\n",
	"# Shuffle our dataset\n",
	"X_y = list(zip(X_all, filenames))\n",
	"random.shuffle(X_y)\n",
	"X_all, filenames = zip(*X_y)\n",
	"\n",
	"# Calculate number of validation and test samples to put aside (round down)\n",
	"num_samples_test = int(TEST_RATIO * num_samples)\n",
	"\n",
	"# The first `num_samples_test` samples of the shuffled list becomes the test set\n",
	"X_test = X_all[:num_samples_test]\n",
	"filenames_test = filenames[:num_samples_test]\n",
	"\n",
	"# The remaining samples become the training set\n",
	"X_train = X_all[num_samples_test:]\n",
	"filenames_train = filenames[num_samples_test:]\n",
	"\n",
	"# Convert data to Numpy arrays\n",
	"X_train = np.asarray(X_train)\n",
	"X_test = np.asarray(X_test)\n",
	"\n",
	"# Print shapes of our sets\n",
	"print(\"X_train shape:\", X_train.shape)\n",
	"print(\"X_test shape:\", X_test.shape)"
	],
	"metadata": {
	"id": "hg8M_tZUXLbp"
	},
	"execution_count": null,
	"outputs": []
	},
	{
	"cell_type": "markdown",
	"source": [
	"## Step 3: Analyze the training data\n",
	"\n",
	"Look at the histograms to determine if scaling is required"
	],
	"metadata": {
	"id": "4RBOlTX3WJN5"
	}
	},
	{
	"cell_type": "code",
	"source": [
	"### Reshape the data (drop timestamp column)\n",
	"def flatten_data_for_analysis(X, num_channels):\n",
	"\n",
	" # Calculate number of rows in each channel (channel = different sensor reading)\n",
	" num_rows = X.shape[0] * X_train.shape[1]\n",
	"\n",
	" # Combine all data in each channel\n",
	" X_flatten = np.reshape(X, (num_rows, num_channels))\n",
	"\n",
	" # Drop the timestamp column--it will mess up our analysis\n",
	" X_flatten = np.delete(X_flatten, 0, axis=1)\n",
	"\n",
	" return X_flatten"
	],
	"metadata": {
	"id": "nBUKlHOgZfOP"
	},
	"execution_count": null,
	"outputs": []
	},
	{
	"cell_type": "code",
	"source": [
	"### Examine the histograms of all the data\n",
	"\n",
	"# Settings\n",
	"num_bins = 80\n",
	"\n",
	"# Flatten the data along each channel\n",
	"X_train_flatten = flatten_data_for_analysis(X_train, num_channels)\n",
	"channel_names_no_timestamp = channel_names[1:]\n",
	"\n",
	"# Create subplots\n",
	"num_hists = len(channel_names_no_timestamp)\n",
	"fig, axs = plt.subplots(1, num_hists, figsize=(20,3))\n",
	"\n",
	"# Create histogram for each category of data\n",
	"for i in range(num_hists):\n",
	" _ = axs[i].hist(X_train_flatten[:, i], \n",
	" bins=num_bins)\n",
	" axs[i].title.set_text(channel_names_no_timestamp[i])"
	],
	"metadata": {
	"id": "4iNVDN0nZa8s"
	},
	"execution_count": null,
	"outputs": []
	},
	{
	"cell_type": "markdown",
	"source": [
	"This look like fairly well-behaved data with Gaussian distributions. However, look at the X-axis range (the min and max values that each channel contains in our data)."
	],
	"metadata": {
	"id": "PZr0kps6dKZP"
	}
	},
	{
	"cell_type": "code",
	"source": [
	"### Try the histograms with the same scale\n",
	"\n",
	"# Get the minimum and maximum values (the range)\n",
	"min_val = X_train_flatten.min()\n",
	"max_val = X_train_flatten.max()\n",
	"\n",
	"# Create subplots\n",
	"num_hists = len(channel_names_no_timestamp)\n",
	"fig, axs = plt.subplots(1, num_hists, figsize=(20,3))\n",
	"\n",
	"# Create histogram for each category of data\n",
	"for i in range(num_hists):\n",
	" _ = axs[i].hist(X_train_flatten[:, i], \n",
	" bins=num_bins, \n",
	" range=(min_val, max_val))\n",
	" axs[i].title.set_text(channel_names_no_timestamp[i])"
	],
	"metadata": {
	"id": "evC25Bc3azWa"
	},
	"execution_count": null,
	"outputs": []
	},
	{
	"cell_type": "markdown",
	"source": [
	"Whoa! If we graph using the same range, it looks like there's a lot more variance in our gyroscope data. However, that's simply becuase the gyroscope uses different units (and therefore has a different range of values) than our accelerometer. To fix this, we should standardize our data."
	],
	"metadata": {
	"id": "NOWr6cx9dZ74"
	}
	},
	{
	"cell_type": "markdown",
	"source": [
	"## Step 4: Standardize the data\n",
	"\n",
	"Perform standarization so that our data, per-channel, has a mean of 0 and a standard deviation of 1."
	],
	"metadata": {
	"id": "ap60olHaeDBN"
	}
	},
	{
	"cell_type": "code",
	"source": [
	"### Function to calculate dataset metrics (mean, std dev, etc.) for each channel\n",
	"def calc_metrics(X, ignore_first_col=False):\n",
	"\n",
	" # Flatten along the channels\n",
	" num_rows = X.shape[0] * X.shape[1]\n",
	" X_flatten = np.reshape(X, (num_rows, num_channels))\n",
	"\n",
	" # Calculate means, standard deviations, and ranges\n",
	" means = np.mean(X_flatten, axis=0)\n",
	" std_devs = np.std(X_flatten, axis=0)\n",
	" mins = np.min(X_flatten, axis=0)\n",
	" ranges = np.ptp(X_flatten, axis=0)\n",
	"\n",
	" # Drop the first column if requested\n",
	" if ignore_first_col:\n",
	" return (means[1:], std_devs[1:], mins[1:], ranges[1:])\n",
	" else:\n",
	" return (means, std_devs, mins, ranges)"
	],
	"metadata": {
	"id": "1RNw3STObUaX"
	},
	"execution_count": null,
	"outputs": []
	},
	{
	"cell_type": "code",
	"source": [
	"### Function to perform standardization for a given set of data\n",
	"def standardize_data(a, mean, std_dev):\n",
	" standardized_a = (a - mean) / std_dev\n",
	" return standardized_a"
	],
	"metadata": {
	"id": "25a_1gfLexyL"
	},
	"execution_count": null,
	"outputs": []
	},
	{
	"cell_type": "code",
	"source": [
	"### Compute the metrics of the training data\n",
	"\n",
	"# Compute metrics (drop timestamp column)\n",
	"(means, std_devs, mins, ranges) = calc_metrics(X_train, ignore_first_col=True)\n",
	"\n",
	"# Print out the results (drop timestamp column)\n",
	"print(channel_names[1:])\n",
	"print(\"Means:\", [float(\"{:.4f}\".format(x)) for x in means])\n",
	"print(\"Std devs:\", [float(\"{:.4f}\".format(x)) for x in std_devs])\n",
	"print(\"Mins:\", [float(\"{:.4f}\".format(x)) for x in mins])\n",
	"print(\"Ranges:\", [float(\"{:.4f}\".format(x)) for x in ranges])"
	],
	"metadata": {
	"id": "xkZoBU4uf0U4"
	},
	"execution_count": null,
	"outputs": []
	},
	{
	"cell_type": "markdown",
	"source": [
	"Record these values! We will need them to perform preprocessing on the test dataset and for preprocessing raw data we capture during live inference (deployment)."
	],
	"metadata": {
	"id": "9MxaQoqogf4-"
	}
	},
	{
	"cell_type": "code",
	"source": [
	"### Standardize each channel (do NOT standardize the timestamp channel!)\n",
	"\n",
	"# Initialize standardized data arrays\n",
	"X_train_std = np.zeros(X_train.shape)\n",
	"X_test_std = np.zeros(X_test.shape)\n",
	"\n",
	"# Go through each channel in the training data\n",
	"for i in range(len(channel_names)):\n",
	" \n",
	" # Skip the timestamp channel!\n",
	" if i == 0:\n",
	" X_train_std[:,:,i] = X_train[:,:,i]\n",
	"\n",
	" # Otherwise, perform standardization\n",
	" else:\n",
	" X_train_std[:,:,i] = standardize_data(X_train[:,:,i], \n",
	" means[i - 1], \n",
	" std_devs[i - 1])\n",
	"\n",
	"# Go through each channel in the test data. Notice that we use the same means\n",
	"# and standard deviations that we calculated from the training data!\n",
	"for i in range(len(channel_names)):\n",
	" \n",
	" # Skip the timestamp channel!\n",
	" if i == 0:\n",
	" X_test_std[:,:,i] = X_test[:,:,i]\n",
	"\n",
	" # Otherwise, perform standardization\n",
	" else:\n",
	" X_test_std[:,:,i] = standardize_data(X_test[:,:,i], \n",
	" means[i - 1], \n",
	" std_devs[i - 1])\n",
	" \n",
	"# Print shapes\n",
	"print(\"X_train_std shape:\", X_train_std.shape)\n",
	"print(\"X_test_std shape:\", X_test_std.shape)"
	],
	"metadata": {
	"id": "hkzcbbbqgKad"
	},
	"execution_count": null,
	"outputs": []
	},
	{
	"cell_type": "code",
	"source": [
	"### Examine the metrics and histograms of the newly standardized data\n",
	"\n",
	"# Compute metrics for the standardized data\n",
	"(means, std_devs, mins, ranges) = calc_metrics(X_train_std, \n",
	" ignore_first_col=True)\n",
	"print(channel_names[1:])\n",
	"print(\"Means:\", [float(\"{:.4f}\".format(x)) for x in means])\n",
	"print(\"Std devs:\", [float(\"{:.4f}\".format(x)) for x in std_devs])\n",
	"print(\"Mins:\", [float(\"{:.4f}\".format(x)) for x in mins])\n",
	"print(\"Ranges:\", [float(\"{:.4f}\".format(x)) for x in ranges])\n",
	"\n",
	"# Flatten the data along each channel\n",
	"X_train_flatten = flatten_data_for_analysis(X_train_std, num_channels)\n",
	"channel_names_no_timestamp = channel_names[1:]\n",
	"\n",
	"# Create subplots\n",
	"num_hists = len(channel_names_no_timestamp)\n",
	"fig, axs = plt.subplots(1, num_hists, figsize=(20,3))\n",
	"\n",
	"# Create histogram for each category of data\n",
	"for i in range(num_hists):\n",
	" _ = axs[i].hist(X_train_flatten[:, i], \n",
	" bins=num_bins)\n",
	" axs[i].title.set_text(channel_names_no_timestamp[i])"
	],
	"metadata": {
	"id": "4YVPATbKilPk"
	},
	"execution_count": null,
	"outputs": []
	},
	{
	"cell_type": "markdown",
	"source": [
	"Check the metrics and histograms above. All means should be 0.0 and all standard deviations should be 1.0. Do the ranges seem more reasonable?"
	],
	"metadata": {
	"id": "NOFRnfuCv7zI"
	}
	},
	{
	"cell_type": "markdown",
	"source": [
	"## Steap 5: Store preprocessed data in CSV files"
	],
	"metadata": {
	"id": "_EXrpdKUwKDp"
	}
	},
	{
	"cell_type": "code",
	"source": [
	"### Function to write header and data to CSV files to given directory\n",
	"def write_csv_data(header, data, filenames, dir_path):\n",
	"\n",
	" # Go through each filename - should be in the same order as our samples in X\n",
	" for i, filename in enumerate(filenames):\n",
	"\n",
	" # Write header and data (for that one sample) to the CSV file\n",
	" file_path = os.path.join(dir_path, filename)\n",
	" with open(file_path, 'w') as f:\n",
	" csv_writer = csv.writer(f, delimiter=',')\n",
	" csv_writer.writerow(header)\n",
	" csv_writer.writerows(data[i])"
	],
	"metadata": {
	"id": "dFG2Mo6DzVcD"
	},
	"execution_count": null,
	"outputs": []
	},
	{
	"cell_type": "code",
	"source": [
	"### Delete output directory (if it exists) and recreate it\n",
	"if os.path.exists(OUT_PATH):\n",
	" shutil.rmtree(OUT_PATH)\n",
	"os.makedirs(os.path.join(OUT_PATH, TRAIN_DIR))\n",
	"os.makedirs(os.path.join(OUT_PATH, TEST_DIR))"
	],
	"metadata": {
	"id": "ESCFHqP4j0H8"
	},
	"execution_count": null,
	"outputs": []
	},
	{
	"cell_type": "code",
	"source": [
	"### Write training and test data to .csv files in separate directories\n",
	"\n",
	"# Write out training data\n",
	"dir_path = os.path.join(OUT_PATH, TRAIN_DIR)\n",
	"write_csv_data(channel_names, X_train_std, filenames_train, dir_path)\n",
	"\n",
	"# Write out test data\n",
	"dir_path = os.path.join(OUT_PATH, TEST_DIR)\n",
	"write_csv_data(channel_names, X_test_std, filenames_test, dir_path)"
	],
	"metadata": {
	"id": "ZtMyzFV0wOPy"
	},
	"execution_count": null,
	"outputs": []
	},
	{
	"cell_type": "code",
	"source": [
	"### Zip output directory\n",
	"%cd {OUT_PATH}\n",
	"!zip -FS -r -q {OUT_ZIP} *\n",
	"%cd {HOME_PATH}"
	],
	"metadata": {
	"id": "oWnbpkCk1Jdj"
	},
	"execution_count": null,
	"outputs": []
	},
	{
	"cell_type": "code",
	"source": [
	""
	],
	"metadata": {
	"id": "OcxLdV7x1lPN"
	},
	"execution_count": null,
	"outputs": []
	}
	]
	}