Last active
May 1, 2023 14:20
-
-
Save ShawnHymel/377f33c96876c4c7201083ed4da3c76a to your computer and use it in GitHub Desktop.
CSV Time Series Dataset Curation and Standardization
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
{ | |
"nbformat": 4, | |
"nbformat_minor": 0, | |
"metadata": { | |
"colab": { | |
"name": "time-series-dataset-curation.ipynb", | |
"provenance": [] | |
}, | |
"kernelspec": { | |
"name": "python3", | |
"display_name": "Python 3" | |
}, | |
"language_info": { | |
"name": "python" | |
} | |
}, | |
"cells": [ | |
{ | |
"cell_type": "markdown", | |
"source": [ | |
"# Dataset curation for time series\n", | |
"\n", | |
"[![Open In Colab <](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ShawnHymel/377f33c96876c4c7201083ed4da3c76a/raw/4678b2b5b391d2e08b455394e6e1e4c669a797bf/time_series_dataset_curation.ipynb)\n", | |
"\n", | |
"In the paper \"Efficient BackProp\" [1], LeCun et al. shows that we can achieve a more accurate model (e.g. artificial neural network) in less time by standarizing (i.e. to a mean of 0 and unit variance) and decorrelating our input data.\n", | |
"\n", | |
"However, the process of standarization assumes that the data is normally distributed (i.e. Gaussian). If our data does not follow a Gaussian distribution, we should perform normalization [2], where we divide by the range to produce a set of values between 0 and 1.\n", | |
"\n", | |
"Create a directory */content/dataset* and upload your entire dataset there. Run through the cells in this notebook, following all of the directions to analyze the data and create a curated dataset. Note that we perform only standardization in this notebook. \n", | |
"\n", | |
"The standardized data will be stored in the */content/out* directory and zipped to */content/out.zip* for easy downloading.\n", | |
"\n", | |
"Author: EdgeImpulse, Inc.<br>\n", | |
"Date: July 28, 2022<br>\n", | |
"License: Apache-2.0<br>\n", | |
"\n", | |
"[1] http://yann.lecun.com/exdb/publis/pdf/lecun-98b.pdf\n", | |
"\n", | |
"[2] https://becominghuman.ai/what-does-feature-scaling-mean-when-to-normalize-data-and-when-to-standardize-data-c3de654405ed " | |
], | |
"metadata": { | |
"id": "6BU8CqPaVWlP" | |
} | |
}, | |
{ | |
"cell_type": "markdown", | |
"source": [ | |
"## Step 1: Read data from CSV files\n", | |
"\n", | |
"Read each CSV, verify that the data (and header) are valid, save the data in Numpy format, and save the associated filename in a list." | |
], | |
"metadata": { | |
"id": "cILorJYMV86Z" | |
} | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": { | |
"id": "cNsxWt2XTD7C" | |
}, | |
"outputs": [], | |
"source": [ | |
"import csv\n", | |
"import os\n", | |
"import shutil\n", | |
"import random\n", | |
"\n", | |
"import numpy as np\n", | |
"import matplotlib.pyplot as plt" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"source": [ | |
"### Settings\n", | |
"\n", | |
"# Path information\n", | |
"HOME_PATH = \"/content\" # Location of the working directory\n", | |
"DATASET_PATH = \"/content/dataset\" # Upload your .csv samples to this directory\n", | |
"OUT_PATH = \"/content/out\" # Where output files go (will be deleted and recreated)\n", | |
"TRAIN_DIR = \"training\" # Where to store training output files\n", | |
"TEST_DIR = \"testing\" # Where to store testing output files\n", | |
"OUT_ZIP = \"/content/out.zip\" # Where to store the zipped output files\n", | |
"\n", | |
"# Set aside 20% for test\n", | |
"TEST_RATIO = 0.2\n", | |
"\n", | |
"# Seed for pseudorandomness \n", | |
"SEED = 42" | |
], | |
"metadata": { | |
"id": "pVVOlAH6TMG1" | |
}, | |
"execution_count": null, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "code", | |
"source": [ | |
"### Read in .csv files to construct our data in a numpy array\n", | |
"\n", | |
"X_all = []\n", | |
"filenames = []\n", | |
"first_sample = True\n", | |
"channel_names = None\n", | |
"sample_shape = None\n", | |
"\n", | |
"# Loop through all files in our dataset\n", | |
"for filename in os.listdir(DATASET_PATH):\n", | |
"\n", | |
" # Check if the path is a file\n", | |
" filepath = os.path.join(DATASET_PATH, filename)\n", | |
" if not os.path.isfile(filepath):\n", | |
" continue\n", | |
"\n", | |
" # Read CSV file\n", | |
" data = np.genfromtxt(filepath, \n", | |
" dtype=float,\n", | |
" delimiter=',',\n", | |
" names=True)\n", | |
"\n", | |
" # Get length of the sample\n", | |
" num_readings = data.shape[0]\n", | |
"\n", | |
" # Extract sample rate (in milliseconds), header (without timestamp), and shape info (without \n", | |
" # timestamp) from the first sample we read\n", | |
" if first_sample:\n", | |
" channel_names = data.dtype.names\n", | |
" sample_shape = (num_readings, len(channel_names))\n", | |
" first_sample = False\n", | |
"\n", | |
" # Check to make sure the new sample conforms to the first sample\n", | |
" else:\n", | |
"\n", | |
" # Check header\n", | |
" if data.dtype.names != channel_names:\n", | |
" print(\"Header does not match. Skipping\", filename)\n", | |
" continue\n", | |
"\n", | |
" # Check shape\n", | |
" if (num_readings, len(channel_names)) != sample_shape:\n", | |
" print(\"Shape does not match. Skipping\", filename)\n", | |
" continue\n", | |
"\n", | |
" # Create sample (drop timestamp column)\n", | |
" sample = np.zeros(sample_shape)\n", | |
" for i in range(num_readings):\n", | |
" sample[i, :] = np.array(data[i].item())\n", | |
"\n", | |
" # Append to our dataset\n", | |
" X_all.append(sample)\n", | |
"\n", | |
" # Append the filename to our list of filenames\n", | |
" filenames.append(filename)\n", | |
"\n", | |
"# Convert the dataset into a numpy array\n", | |
"X_all = np.array(X_all)\n", | |
"\n", | |
"# Get number of samples and channels\n", | |
"num_samples = X_all.shape[0]\n", | |
"num_channels = len(channel_names)\n", | |
"\n", | |
"print(\"Header:\", channel_names)\n", | |
"print(\"Dataset shape:\", X_all.shape)\n", | |
"print(\"Number of samples:\", num_samples)\n", | |
"print(\"Number of files\", len(filenames))" | |
], | |
"metadata": { | |
"id": "ADgJUJOYTNwE" | |
}, | |
"execution_count": null, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "markdown", | |
"source": [ | |
"## Step 2: Split the data\n", | |
"\n", | |
"We should not include the test set in our analysis or scaling efforts, as that could introduce a bias." | |
], | |
"metadata": { | |
"id": "H2Dy4TG9Wuj2" | |
} | |
}, | |
{ | |
"cell_type": "code", | |
"source": [ | |
"### Shuffle and split dataset\n", | |
"\n", | |
"# Use a seed in case we want to recreate the exact results\n", | |
"random.seed(SEED)\n", | |
"\n", | |
"# Shuffle our dataset\n", | |
"X_y = list(zip(X_all, filenames))\n", | |
"random.shuffle(X_y)\n", | |
"X_all, filenames = zip(*X_y)\n", | |
"\n", | |
"# Calculate number of validation and test samples to put aside (round down)\n", | |
"num_samples_test = int(TEST_RATIO * num_samples)\n", | |
"\n", | |
"# The first `num_samples_test` samples of the shuffled list becomes the test set\n", | |
"X_test = X_all[:num_samples_test]\n", | |
"filenames_test = filenames[:num_samples_test]\n", | |
"\n", | |
"# The remaining samples become the training set\n", | |
"X_train = X_all[num_samples_test:]\n", | |
"filenames_train = filenames[num_samples_test:]\n", | |
"\n", | |
"# Convert data to Numpy arrays\n", | |
"X_train = np.asarray(X_train)\n", | |
"X_test = np.asarray(X_test)\n", | |
"\n", | |
"# Print shapes of our sets\n", | |
"print(\"X_train shape:\", X_train.shape)\n", | |
"print(\"X_test shape:\", X_test.shape)" | |
], | |
"metadata": { | |
"id": "hg8M_tZUXLbp" | |
}, | |
"execution_count": null, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "markdown", | |
"source": [ | |
"## Step 3: Analyze the training data\n", | |
"\n", | |
"Look at the histograms to determine if scaling is required" | |
], | |
"metadata": { | |
"id": "4RBOlTX3WJN5" | |
} | |
}, | |
{ | |
"cell_type": "code", | |
"source": [ | |
"### Reshape the data (drop timestamp column)\n", | |
"def flatten_data_for_analysis(X, num_channels):\n", | |
"\n", | |
" # Calculate number of rows in each channel (channel = different sensor reading)\n", | |
" num_rows = X.shape[0] * X_train.shape[1]\n", | |
"\n", | |
" # Combine all data in each channel\n", | |
" X_flatten = np.reshape(X, (num_rows, num_channels))\n", | |
"\n", | |
" # Drop the timestamp column--it will mess up our analysis\n", | |
" X_flatten = np.delete(X_flatten, 0, axis=1)\n", | |
"\n", | |
" return X_flatten" | |
], | |
"metadata": { | |
"id": "nBUKlHOgZfOP" | |
}, | |
"execution_count": null, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "code", | |
"source": [ | |
"### Examine the histograms of all the data\n", | |
"\n", | |
"# Settings\n", | |
"num_bins = 80\n", | |
"\n", | |
"# Flatten the data along each channel\n", | |
"X_train_flatten = flatten_data_for_analysis(X_train, num_channels)\n", | |
"channel_names_no_timestamp = channel_names[1:]\n", | |
"\n", | |
"# Create subplots\n", | |
"num_hists = len(channel_names_no_timestamp)\n", | |
"fig, axs = plt.subplots(1, num_hists, figsize=(20,3))\n", | |
"\n", | |
"# Create histogram for each category of data\n", | |
"for i in range(num_hists):\n", | |
" _ = axs[i].hist(X_train_flatten[:, i], \n", | |
" bins=num_bins)\n", | |
" axs[i].title.set_text(channel_names_no_timestamp[i])" | |
], | |
"metadata": { | |
"id": "4iNVDN0nZa8s" | |
}, | |
"execution_count": null, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "markdown", | |
"source": [ | |
"This look like fairly well-behaved data with Gaussian distributions. However, look at the X-axis range (the min and max values that each channel contains in our data)." | |
], | |
"metadata": { | |
"id": "PZr0kps6dKZP" | |
} | |
}, | |
{ | |
"cell_type": "code", | |
"source": [ | |
"### Try the histograms with the same scale\n", | |
"\n", | |
"# Get the minimum and maximum values (the range)\n", | |
"min_val = X_train_flatten.min()\n", | |
"max_val = X_train_flatten.max()\n", | |
"\n", | |
"# Create subplots\n", | |
"num_hists = len(channel_names_no_timestamp)\n", | |
"fig, axs = plt.subplots(1, num_hists, figsize=(20,3))\n", | |
"\n", | |
"# Create histogram for each category of data\n", | |
"for i in range(num_hists):\n", | |
" _ = axs[i].hist(X_train_flatten[:, i], \n", | |
" bins=num_bins, \n", | |
" range=(min_val, max_val))\n", | |
" axs[i].title.set_text(channel_names_no_timestamp[i])" | |
], | |
"metadata": { | |
"id": "evC25Bc3azWa" | |
}, | |
"execution_count": null, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "markdown", | |
"source": [ | |
"Whoa! If we graph using the same range, it looks like there's a lot more variance in our gyroscope data. However, that's simply becuase the gyroscope uses different units (and therefore has a different range of values) than our accelerometer. To fix this, we should standardize our data." | |
], | |
"metadata": { | |
"id": "NOWr6cx9dZ74" | |
} | |
}, | |
{ | |
"cell_type": "markdown", | |
"source": [ | |
"## Step 4: Standardize the data\n", | |
"\n", | |
"Perform standarization so that our data, per-channel, has a mean of 0 and a standard deviation of 1." | |
], | |
"metadata": { | |
"id": "ap60olHaeDBN" | |
} | |
}, | |
{ | |
"cell_type": "code", | |
"source": [ | |
"### Function to calculate dataset metrics (mean, std dev, etc.) for each channel\n", | |
"def calc_metrics(X, ignore_first_col=False):\n", | |
"\n", | |
" # Flatten along the channels\n", | |
" num_rows = X.shape[0] * X.shape[1]\n", | |
" X_flatten = np.reshape(X, (num_rows, num_channels))\n", | |
"\n", | |
" # Calculate means, standard deviations, and ranges\n", | |
" means = np.mean(X_flatten, axis=0)\n", | |
" std_devs = np.std(X_flatten, axis=0)\n", | |
" mins = np.min(X_flatten, axis=0)\n", | |
" ranges = np.ptp(X_flatten, axis=0)\n", | |
"\n", | |
" # Drop the first column if requested\n", | |
" if ignore_first_col:\n", | |
" return (means[1:], std_devs[1:], mins[1:], ranges[1:])\n", | |
" else:\n", | |
" return (means, std_devs, mins, ranges)" | |
], | |
"metadata": { | |
"id": "1RNw3STObUaX" | |
}, | |
"execution_count": null, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "code", | |
"source": [ | |
"### Function to perform standardization for a given set of data\n", | |
"def standardize_data(a, mean, std_dev):\n", | |
" standardized_a = (a - mean) / std_dev\n", | |
" return standardized_a" | |
], | |
"metadata": { | |
"id": "25a_1gfLexyL" | |
}, | |
"execution_count": null, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "code", | |
"source": [ | |
"### Compute the metrics of the training data\n", | |
"\n", | |
"# Compute metrics (drop timestamp column)\n", | |
"(means, std_devs, mins, ranges) = calc_metrics(X_train, ignore_first_col=True)\n", | |
"\n", | |
"# Print out the results (drop timestamp column)\n", | |
"print(channel_names[1:])\n", | |
"print(\"Means:\", [float(\"{:.4f}\".format(x)) for x in means])\n", | |
"print(\"Std devs:\", [float(\"{:.4f}\".format(x)) for x in std_devs])\n", | |
"print(\"Mins:\", [float(\"{:.4f}\".format(x)) for x in mins])\n", | |
"print(\"Ranges:\", [float(\"{:.4f}\".format(x)) for x in ranges])" | |
], | |
"metadata": { | |
"id": "xkZoBU4uf0U4" | |
}, | |
"execution_count": null, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "markdown", | |
"source": [ | |
"**Record these values!** We will need them to perform preprocessing on the test dataset and for preprocessing raw data we capture during live inference (deployment)." | |
], | |
"metadata": { | |
"id": "9MxaQoqogf4-" | |
} | |
}, | |
{ | |
"cell_type": "code", | |
"source": [ | |
"### Standardize each channel (do NOT standardize the timestamp channel!)\n", | |
"\n", | |
"# Initialize standardized data arrays\n", | |
"X_train_std = np.zeros(X_train.shape)\n", | |
"X_test_std = np.zeros(X_test.shape)\n", | |
"\n", | |
"# Go through each channel in the training data\n", | |
"for i in range(len(channel_names)):\n", | |
" \n", | |
" # Skip the timestamp channel!\n", | |
" if i == 0:\n", | |
" X_train_std[:,:,i] = X_train[:,:,i]\n", | |
"\n", | |
" # Otherwise, perform standardization\n", | |
" else:\n", | |
" X_train_std[:,:,i] = standardize_data(X_train[:,:,i], \n", | |
" means[i - 1], \n", | |
" std_devs[i - 1])\n", | |
"\n", | |
"# Go through each channel in the test data. Notice that we use the same means\n", | |
"# and standard deviations that we calculated from the training data!\n", | |
"for i in range(len(channel_names)):\n", | |
" \n", | |
" # Skip the timestamp channel!\n", | |
" if i == 0:\n", | |
" X_test_std[:,:,i] = X_test[:,:,i]\n", | |
"\n", | |
" # Otherwise, perform standardization\n", | |
" else:\n", | |
" X_test_std[:,:,i] = standardize_data(X_test[:,:,i], \n", | |
" means[i - 1], \n", | |
" std_devs[i - 1])\n", | |
" \n", | |
"# Print shapes\n", | |
"print(\"X_train_std shape:\", X_train_std.shape)\n", | |
"print(\"X_test_std shape:\", X_test_std.shape)" | |
], | |
"metadata": { | |
"id": "hkzcbbbqgKad" | |
}, | |
"execution_count": null, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "code", | |
"source": [ | |
"### Examine the metrics and histograms of the newly standardized data\n", | |
"\n", | |
"# Compute metrics for the standardized data\n", | |
"(means, std_devs, mins, ranges) = calc_metrics(X_train_std, \n", | |
" ignore_first_col=True)\n", | |
"print(channel_names[1:])\n", | |
"print(\"Means:\", [float(\"{:.4f}\".format(x)) for x in means])\n", | |
"print(\"Std devs:\", [float(\"{:.4f}\".format(x)) for x in std_devs])\n", | |
"print(\"Mins:\", [float(\"{:.4f}\".format(x)) for x in mins])\n", | |
"print(\"Ranges:\", [float(\"{:.4f}\".format(x)) for x in ranges])\n", | |
"\n", | |
"# Flatten the data along each channel\n", | |
"X_train_flatten = flatten_data_for_analysis(X_train_std, num_channels)\n", | |
"channel_names_no_timestamp = channel_names[1:]\n", | |
"\n", | |
"# Create subplots\n", | |
"num_hists = len(channel_names_no_timestamp)\n", | |
"fig, axs = plt.subplots(1, num_hists, figsize=(20,3))\n", | |
"\n", | |
"# Create histogram for each category of data\n", | |
"for i in range(num_hists):\n", | |
" _ = axs[i].hist(X_train_flatten[:, i], \n", | |
" bins=num_bins)\n", | |
" axs[i].title.set_text(channel_names_no_timestamp[i])" | |
], | |
"metadata": { | |
"id": "4YVPATbKilPk" | |
}, | |
"execution_count": null, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "markdown", | |
"source": [ | |
"Check the metrics and histograms above. All means should be 0.0 and all standard deviations should be 1.0. Do the ranges seem more reasonable?" | |
], | |
"metadata": { | |
"id": "NOFRnfuCv7zI" | |
} | |
}, | |
{ | |
"cell_type": "markdown", | |
"source": [ | |
"## Steap 5: Store preprocessed data in CSV files" | |
], | |
"metadata": { | |
"id": "_EXrpdKUwKDp" | |
} | |
}, | |
{ | |
"cell_type": "code", | |
"source": [ | |
"### Function to write header and data to CSV files to given directory\n", | |
"def write_csv_data(header, data, filenames, dir_path):\n", | |
"\n", | |
" # Go through each filename - should be in the same order as our samples in X\n", | |
" for i, filename in enumerate(filenames):\n", | |
"\n", | |
" # Write header and data (for that one sample) to the CSV file\n", | |
" file_path = os.path.join(dir_path, filename)\n", | |
" with open(file_path, 'w') as f:\n", | |
" csv_writer = csv.writer(f, delimiter=',')\n", | |
" csv_writer.writerow(header)\n", | |
" csv_writer.writerows(data[i])" | |
], | |
"metadata": { | |
"id": "dFG2Mo6DzVcD" | |
}, | |
"execution_count": null, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "code", | |
"source": [ | |
"### Delete output directory (if it exists) and recreate it\n", | |
"if os.path.exists(OUT_PATH):\n", | |
" shutil.rmtree(OUT_PATH)\n", | |
"os.makedirs(os.path.join(OUT_PATH, TRAIN_DIR))\n", | |
"os.makedirs(os.path.join(OUT_PATH, TEST_DIR))" | |
], | |
"metadata": { | |
"id": "ESCFHqP4j0H8" | |
}, | |
"execution_count": null, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "code", | |
"source": [ | |
"### Write training and test data to .csv files in separate directories\n", | |
"\n", | |
"# Write out training data\n", | |
"dir_path = os.path.join(OUT_PATH, TRAIN_DIR)\n", | |
"write_csv_data(channel_names, X_train_std, filenames_train, dir_path)\n", | |
"\n", | |
"# Write out test data\n", | |
"dir_path = os.path.join(OUT_PATH, TEST_DIR)\n", | |
"write_csv_data(channel_names, X_test_std, filenames_test, dir_path)" | |
], | |
"metadata": { | |
"id": "ZtMyzFV0wOPy" | |
}, | |
"execution_count": null, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "code", | |
"source": [ | |
"### Zip output directory\n", | |
"%cd {OUT_PATH}\n", | |
"!zip -FS -r -q {OUT_ZIP} *\n", | |
"%cd {HOME_PATH}" | |
], | |
"metadata": { | |
"id": "oWnbpkCk1Jdj" | |
}, | |
"execution_count": null, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "code", | |
"source": [ | |
"" | |
], | |
"metadata": { | |
"id": "OcxLdV7x1lPN" | |
}, | |
"execution_count": null, | |
"outputs": [] | |
} | |
] | |
} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment