Skip to content

Instantly share code, notes, and snippets.

@ShawnHymel
Last active May 1, 2023 14:20
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save ShawnHymel/377f33c96876c4c7201083ed4da3c76a to your computer and use it in GitHub Desktop.
Save ShawnHymel/377f33c96876c4c7201083ed4da3c76a to your computer and use it in GitHub Desktop.
CSV Time Series Dataset Curation and Standardization
Display the source blob
Display the rendered blob
Raw
{
"nbformat": 4,
"nbformat_minor": 0,
"metadata": {
"colab": {
"name": "time-series-dataset-curation.ipynb",
"provenance": []
},
"kernelspec": {
"name": "python3",
"display_name": "Python 3"
},
"language_info": {
"name": "python"
}
},
"cells": [
{
"cell_type": "markdown",
"source": [
"# Dataset curation for time series\n",
"\n",
"[![Open In Colab <](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ShawnHymel/377f33c96876c4c7201083ed4da3c76a/raw/4678b2b5b391d2e08b455394e6e1e4c669a797bf/time_series_dataset_curation.ipynb)\n",
"\n",
"In the paper \"Efficient BackProp\" [1], LeCun et al. shows that we can achieve a more accurate model (e.g. artificial neural network) in less time by standarizing (i.e. to a mean of 0 and unit variance) and decorrelating our input data.\n",
"\n",
"However, the process of standarization assumes that the data is normally distributed (i.e. Gaussian). If our data does not follow a Gaussian distribution, we should perform normalization [2], where we divide by the range to produce a set of values between 0 and 1.\n",
"\n",
"Create a directory */content/dataset* and upload your entire dataset there. Run through the cells in this notebook, following all of the directions to analyze the data and create a curated dataset. Note that we perform only standardization in this notebook. \n",
"\n",
"The standardized data will be stored in the */content/out* directory and zipped to */content/out.zip* for easy downloading.\n",
"\n",
"Author: EdgeImpulse, Inc.<br>\n",
"Date: July 28, 2022<br>\n",
"License: Apache-2.0<br>\n",
"\n",
"[1] http://yann.lecun.com/exdb/publis/pdf/lecun-98b.pdf\n",
"\n",
"[2] https://becominghuman.ai/what-does-feature-scaling-mean-when-to-normalize-data-and-when-to-standardize-data-c3de654405ed "
],
"metadata": {
"id": "6BU8CqPaVWlP"
}
},
{
"cell_type": "markdown",
"source": [
"## Step 1: Read data from CSV files\n",
"\n",
"Read each CSV, verify that the data (and header) are valid, save the data in Numpy format, and save the associated filename in a list."
],
"metadata": {
"id": "cILorJYMV86Z"
}
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "cNsxWt2XTD7C"
},
"outputs": [],
"source": [
"import csv\n",
"import os\n",
"import shutil\n",
"import random\n",
"\n",
"import numpy as np\n",
"import matplotlib.pyplot as plt"
]
},
{
"cell_type": "code",
"source": [
"### Settings\n",
"\n",
"# Path information\n",
"HOME_PATH = \"/content\" # Location of the working directory\n",
"DATASET_PATH = \"/content/dataset\" # Upload your .csv samples to this directory\n",
"OUT_PATH = \"/content/out\" # Where output files go (will be deleted and recreated)\n",
"TRAIN_DIR = \"training\" # Where to store training output files\n",
"TEST_DIR = \"testing\" # Where to store testing output files\n",
"OUT_ZIP = \"/content/out.zip\" # Where to store the zipped output files\n",
"\n",
"# Set aside 20% for test\n",
"TEST_RATIO = 0.2\n",
"\n",
"# Seed for pseudorandomness \n",
"SEED = 42"
],
"metadata": {
"id": "pVVOlAH6TMG1"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"### Read in .csv files to construct our data in a numpy array\n",
"\n",
"X_all = []\n",
"filenames = []\n",
"first_sample = True\n",
"channel_names = None\n",
"sample_shape = None\n",
"\n",
"# Loop through all files in our dataset\n",
"for filename in os.listdir(DATASET_PATH):\n",
"\n",
" # Check if the path is a file\n",
" filepath = os.path.join(DATASET_PATH, filename)\n",
" if not os.path.isfile(filepath):\n",
" continue\n",
"\n",
" # Read CSV file\n",
" data = np.genfromtxt(filepath, \n",
" dtype=float,\n",
" delimiter=',',\n",
" names=True)\n",
"\n",
" # Get length of the sample\n",
" num_readings = data.shape[0]\n",
"\n",
" # Extract sample rate (in milliseconds), header (without timestamp), and shape info (without \n",
" # timestamp) from the first sample we read\n",
" if first_sample:\n",
" channel_names = data.dtype.names\n",
" sample_shape = (num_readings, len(channel_names))\n",
" first_sample = False\n",
"\n",
" # Check to make sure the new sample conforms to the first sample\n",
" else:\n",
"\n",
" # Check header\n",
" if data.dtype.names != channel_names:\n",
" print(\"Header does not match. Skipping\", filename)\n",
" continue\n",
"\n",
" # Check shape\n",
" if (num_readings, len(channel_names)) != sample_shape:\n",
" print(\"Shape does not match. Skipping\", filename)\n",
" continue\n",
"\n",
" # Create sample (drop timestamp column)\n",
" sample = np.zeros(sample_shape)\n",
" for i in range(num_readings):\n",
" sample[i, :] = np.array(data[i].item())\n",
"\n",
" # Append to our dataset\n",
" X_all.append(sample)\n",
"\n",
" # Append the filename to our list of filenames\n",
" filenames.append(filename)\n",
"\n",
"# Convert the dataset into a numpy array\n",
"X_all = np.array(X_all)\n",
"\n",
"# Get number of samples and channels\n",
"num_samples = X_all.shape[0]\n",
"num_channels = len(channel_names)\n",
"\n",
"print(\"Header:\", channel_names)\n",
"print(\"Dataset shape:\", X_all.shape)\n",
"print(\"Number of samples:\", num_samples)\n",
"print(\"Number of files\", len(filenames))"
],
"metadata": {
"id": "ADgJUJOYTNwE"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"## Step 2: Split the data\n",
"\n",
"We should not include the test set in our analysis or scaling efforts, as that could introduce a bias."
],
"metadata": {
"id": "H2Dy4TG9Wuj2"
}
},
{
"cell_type": "code",
"source": [
"### Shuffle and split dataset\n",
"\n",
"# Use a seed in case we want to recreate the exact results\n",
"random.seed(SEED)\n",
"\n",
"# Shuffle our dataset\n",
"X_y = list(zip(X_all, filenames))\n",
"random.shuffle(X_y)\n",
"X_all, filenames = zip(*X_y)\n",
"\n",
"# Calculate number of validation and test samples to put aside (round down)\n",
"num_samples_test = int(TEST_RATIO * num_samples)\n",
"\n",
"# The first `num_samples_test` samples of the shuffled list becomes the test set\n",
"X_test = X_all[:num_samples_test]\n",
"filenames_test = filenames[:num_samples_test]\n",
"\n",
"# The remaining samples become the training set\n",
"X_train = X_all[num_samples_test:]\n",
"filenames_train = filenames[num_samples_test:]\n",
"\n",
"# Convert data to Numpy arrays\n",
"X_train = np.asarray(X_train)\n",
"X_test = np.asarray(X_test)\n",
"\n",
"# Print shapes of our sets\n",
"print(\"X_train shape:\", X_train.shape)\n",
"print(\"X_test shape:\", X_test.shape)"
],
"metadata": {
"id": "hg8M_tZUXLbp"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"## Step 3: Analyze the training data\n",
"\n",
"Look at the histograms to determine if scaling is required"
],
"metadata": {
"id": "4RBOlTX3WJN5"
}
},
{
"cell_type": "code",
"source": [
"### Reshape the data (drop timestamp column)\n",
"def flatten_data_for_analysis(X, num_channels):\n",
"\n",
" # Calculate number of rows in each channel (channel = different sensor reading)\n",
" num_rows = X.shape[0] * X_train.shape[1]\n",
"\n",
" # Combine all data in each channel\n",
" X_flatten = np.reshape(X, (num_rows, num_channels))\n",
"\n",
" # Drop the timestamp column--it will mess up our analysis\n",
" X_flatten = np.delete(X_flatten, 0, axis=1)\n",
"\n",
" return X_flatten"
],
"metadata": {
"id": "nBUKlHOgZfOP"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"### Examine the histograms of all the data\n",
"\n",
"# Settings\n",
"num_bins = 80\n",
"\n",
"# Flatten the data along each channel\n",
"X_train_flatten = flatten_data_for_analysis(X_train, num_channels)\n",
"channel_names_no_timestamp = channel_names[1:]\n",
"\n",
"# Create subplots\n",
"num_hists = len(channel_names_no_timestamp)\n",
"fig, axs = plt.subplots(1, num_hists, figsize=(20,3))\n",
"\n",
"# Create histogram for each category of data\n",
"for i in range(num_hists):\n",
" _ = axs[i].hist(X_train_flatten[:, i], \n",
" bins=num_bins)\n",
" axs[i].title.set_text(channel_names_no_timestamp[i])"
],
"metadata": {
"id": "4iNVDN0nZa8s"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"This look like fairly well-behaved data with Gaussian distributions. However, look at the X-axis range (the min and max values that each channel contains in our data)."
],
"metadata": {
"id": "PZr0kps6dKZP"
}
},
{
"cell_type": "code",
"source": [
"### Try the histograms with the same scale\n",
"\n",
"# Get the minimum and maximum values (the range)\n",
"min_val = X_train_flatten.min()\n",
"max_val = X_train_flatten.max()\n",
"\n",
"# Create subplots\n",
"num_hists = len(channel_names_no_timestamp)\n",
"fig, axs = plt.subplots(1, num_hists, figsize=(20,3))\n",
"\n",
"# Create histogram for each category of data\n",
"for i in range(num_hists):\n",
" _ = axs[i].hist(X_train_flatten[:, i], \n",
" bins=num_bins, \n",
" range=(min_val, max_val))\n",
" axs[i].title.set_text(channel_names_no_timestamp[i])"
],
"metadata": {
"id": "evC25Bc3azWa"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"Whoa! If we graph using the same range, it looks like there's a lot more variance in our gyroscope data. However, that's simply becuase the gyroscope uses different units (and therefore has a different range of values) than our accelerometer. To fix this, we should standardize our data."
],
"metadata": {
"id": "NOWr6cx9dZ74"
}
},
{
"cell_type": "markdown",
"source": [
"## Step 4: Standardize the data\n",
"\n",
"Perform standarization so that our data, per-channel, has a mean of 0 and a standard deviation of 1."
],
"metadata": {
"id": "ap60olHaeDBN"
}
},
{
"cell_type": "code",
"source": [
"### Function to calculate dataset metrics (mean, std dev, etc.) for each channel\n",
"def calc_metrics(X, ignore_first_col=False):\n",
"\n",
" # Flatten along the channels\n",
" num_rows = X.shape[0] * X.shape[1]\n",
" X_flatten = np.reshape(X, (num_rows, num_channels))\n",
"\n",
" # Calculate means, standard deviations, and ranges\n",
" means = np.mean(X_flatten, axis=0)\n",
" std_devs = np.std(X_flatten, axis=0)\n",
" mins = np.min(X_flatten, axis=0)\n",
" ranges = np.ptp(X_flatten, axis=0)\n",
"\n",
" # Drop the first column if requested\n",
" if ignore_first_col:\n",
" return (means[1:], std_devs[1:], mins[1:], ranges[1:])\n",
" else:\n",
" return (means, std_devs, mins, ranges)"
],
"metadata": {
"id": "1RNw3STObUaX"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"### Function to perform standardization for a given set of data\n",
"def standardize_data(a, mean, std_dev):\n",
" standardized_a = (a - mean) / std_dev\n",
" return standardized_a"
],
"metadata": {
"id": "25a_1gfLexyL"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"### Compute the metrics of the training data\n",
"\n",
"# Compute metrics (drop timestamp column)\n",
"(means, std_devs, mins, ranges) = calc_metrics(X_train, ignore_first_col=True)\n",
"\n",
"# Print out the results (drop timestamp column)\n",
"print(channel_names[1:])\n",
"print(\"Means:\", [float(\"{:.4f}\".format(x)) for x in means])\n",
"print(\"Std devs:\", [float(\"{:.4f}\".format(x)) for x in std_devs])\n",
"print(\"Mins:\", [float(\"{:.4f}\".format(x)) for x in mins])\n",
"print(\"Ranges:\", [float(\"{:.4f}\".format(x)) for x in ranges])"
],
"metadata": {
"id": "xkZoBU4uf0U4"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"**Record these values!** We will need them to perform preprocessing on the test dataset and for preprocessing raw data we capture during live inference (deployment)."
],
"metadata": {
"id": "9MxaQoqogf4-"
}
},
{
"cell_type": "code",
"source": [
"### Standardize each channel (do NOT standardize the timestamp channel!)\n",
"\n",
"# Initialize standardized data arrays\n",
"X_train_std = np.zeros(X_train.shape)\n",
"X_test_std = np.zeros(X_test.shape)\n",
"\n",
"# Go through each channel in the training data\n",
"for i in range(len(channel_names)):\n",
" \n",
" # Skip the timestamp channel!\n",
" if i == 0:\n",
" X_train_std[:,:,i] = X_train[:,:,i]\n",
"\n",
" # Otherwise, perform standardization\n",
" else:\n",
" X_train_std[:,:,i] = standardize_data(X_train[:,:,i], \n",
" means[i - 1], \n",
" std_devs[i - 1])\n",
"\n",
"# Go through each channel in the test data. Notice that we use the same means\n",
"# and standard deviations that we calculated from the training data!\n",
"for i in range(len(channel_names)):\n",
" \n",
" # Skip the timestamp channel!\n",
" if i == 0:\n",
" X_test_std[:,:,i] = X_test[:,:,i]\n",
"\n",
" # Otherwise, perform standardization\n",
" else:\n",
" X_test_std[:,:,i] = standardize_data(X_test[:,:,i], \n",
" means[i - 1], \n",
" std_devs[i - 1])\n",
" \n",
"# Print shapes\n",
"print(\"X_train_std shape:\", X_train_std.shape)\n",
"print(\"X_test_std shape:\", X_test_std.shape)"
],
"metadata": {
"id": "hkzcbbbqgKad"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"### Examine the metrics and histograms of the newly standardized data\n",
"\n",
"# Compute metrics for the standardized data\n",
"(means, std_devs, mins, ranges) = calc_metrics(X_train_std, \n",
" ignore_first_col=True)\n",
"print(channel_names[1:])\n",
"print(\"Means:\", [float(\"{:.4f}\".format(x)) for x in means])\n",
"print(\"Std devs:\", [float(\"{:.4f}\".format(x)) for x in std_devs])\n",
"print(\"Mins:\", [float(\"{:.4f}\".format(x)) for x in mins])\n",
"print(\"Ranges:\", [float(\"{:.4f}\".format(x)) for x in ranges])\n",
"\n",
"# Flatten the data along each channel\n",
"X_train_flatten = flatten_data_for_analysis(X_train_std, num_channels)\n",
"channel_names_no_timestamp = channel_names[1:]\n",
"\n",
"# Create subplots\n",
"num_hists = len(channel_names_no_timestamp)\n",
"fig, axs = plt.subplots(1, num_hists, figsize=(20,3))\n",
"\n",
"# Create histogram for each category of data\n",
"for i in range(num_hists):\n",
" _ = axs[i].hist(X_train_flatten[:, i], \n",
" bins=num_bins)\n",
" axs[i].title.set_text(channel_names_no_timestamp[i])"
],
"metadata": {
"id": "4YVPATbKilPk"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"Check the metrics and histograms above. All means should be 0.0 and all standard deviations should be 1.0. Do the ranges seem more reasonable?"
],
"metadata": {
"id": "NOFRnfuCv7zI"
}
},
{
"cell_type": "markdown",
"source": [
"## Steap 5: Store preprocessed data in CSV files"
],
"metadata": {
"id": "_EXrpdKUwKDp"
}
},
{
"cell_type": "code",
"source": [
"### Function to write header and data to CSV files to given directory\n",
"def write_csv_data(header, data, filenames, dir_path):\n",
"\n",
" # Go through each filename - should be in the same order as our samples in X\n",
" for i, filename in enumerate(filenames):\n",
"\n",
" # Write header and data (for that one sample) to the CSV file\n",
" file_path = os.path.join(dir_path, filename)\n",
" with open(file_path, 'w') as f:\n",
" csv_writer = csv.writer(f, delimiter=',')\n",
" csv_writer.writerow(header)\n",
" csv_writer.writerows(data[i])"
],
"metadata": {
"id": "dFG2Mo6DzVcD"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"### Delete output directory (if it exists) and recreate it\n",
"if os.path.exists(OUT_PATH):\n",
" shutil.rmtree(OUT_PATH)\n",
"os.makedirs(os.path.join(OUT_PATH, TRAIN_DIR))\n",
"os.makedirs(os.path.join(OUT_PATH, TEST_DIR))"
],
"metadata": {
"id": "ESCFHqP4j0H8"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"### Write training and test data to .csv files in separate directories\n",
"\n",
"# Write out training data\n",
"dir_path = os.path.join(OUT_PATH, TRAIN_DIR)\n",
"write_csv_data(channel_names, X_train_std, filenames_train, dir_path)\n",
"\n",
"# Write out test data\n",
"dir_path = os.path.join(OUT_PATH, TEST_DIR)\n",
"write_csv_data(channel_names, X_test_std, filenames_test, dir_path)"
],
"metadata": {
"id": "ZtMyzFV0wOPy"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"### Zip output directory\n",
"%cd {OUT_PATH}\n",
"!zip -FS -r -q {OUT_ZIP} *\n",
"%cd {HOME_PATH}"
],
"metadata": {
"id": "oWnbpkCk1Jdj"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
""
],
"metadata": {
"id": "OcxLdV7x1lPN"
},
"execution_count": null,
"outputs": []
}
]
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment