Skip to content

Instantly share code, notes, and snippets.

@mserranom
Last active January 26, 2021 19:13
Show Gist options
  • Save mserranom/10aaac360617d58e00f1c380db22592e to your computer and use it in GitHub Desktop.
Save mserranom/10aaac360617d58e00f1c380db22592e to your computer and use it in GitHub Desktop.

Feature Store

Uber Michelangelo

https://eng.uber.com/michelangelo/

Finding good features is often the hardest part of machine learning and we have found that building and managing data pipelines is typically one of the most costly pieces of a complete machine learning solution.

A platform should provide standard tools for building data pipelines to generate feature and label data sets for training (and re-training) and feature-only data sets for predicting. These tools should have deep integration with the company’s data lake or warehouses and with the company’s online data serving systems. The pipelines need to be scalable and performant, incorporate integrated monitoring for data flow and data quality, and support both online and offline training and predicting. Ideally, they should also generate the features in a way that is shareable across teams to reduce duplicate work and increase data quality. They should also provide strong guard rails and controls to encourage and empower users to adopt best practices (e.g., making it easy to guarantee that the same data generation/preparation process is used at both training time and prediction time).

(...) (a) feature store that allows teams to share, discover, and use a highly curated set of features for their machine learning problems.

Goals

  1. Consume data from sources and transform them into Features for ML experiments.
  2. Being able to produce a Feature Set by consuming and mix Features from different sources (including public, curated data sources), with the help of Feature Discovery tooling.
  3. Feature Intelligence:
    • Automated Feature extraction (featuretools.com)
    • Feature Visualization
    • Pre-trained model integration to generate new features
    • Any type of Feature Selection aid
  4. Feature set management and versioning
  5. Training Data generation and management
  6. Prediction Endpoint generation
  7. Availability of an SDK for rapid model training and deployment using existing solutions (Sagemaker, Comet ML, Valohai, others, ...)

Data Consumption

Data Sources:

  • Blurr's DTCs
  • Database Connections
  • Results from DTC Experiments

Feature Set Management

Once the Features comprising the feature set are chosen, the feature set can be saved using a name.

Every time the feature set is updated and changes are saved, a new version is produced. This is important, because training data and prediction endpoints over the Feature Set are likely incompatible between versions.

Other options:

  • We can create a version of the Feature Set only when Training Data is created or a Prediction Endpoint is deployed. In the meantime, we'll just save the latest state.

Training Data Generation and Management.

At any time, Training Data can be generated over a Feature Set. Once done, an S3 link to the data is provided.

The generation can be parameterised:

  • Generate training data only within a range of timestamps
  • Generate partitioned data

Training Data can be generated multiple times for a version of a Feature Set. Generation can be also scheduled, in case of requiring a new model being trained in a regular fashion.

Training Data can potentially be compatible with multiple versions of a Changeset, depending on the nature of the changes. Example: normalisation can be added to the feature set, that won't affect existing denormalised Training Data.

Prediction Endpoint Generation

A Prediction Endpoint can be generated for a specific version of a Feature Set. A URL with the endpoint is provided.

SDK

An opt-in SDK can be used to facilitate work with tools like Sagemaker.

Sagemaker:

data_a = pandas.read_something('data_source_a')
data_a = do_all_transforms(data_a)

data_b = pandas.read_something('data_source_b')
data_b = do_all_transforms(data_b)

training_data = do_all_joining(data_a, data_b)
x_train, y_train, x_test, y_test = split_training_test(training_data)

model = sagemaker.fit_model(x_train, y_train)

url = sagemaker.deploy(model)

http.request(url, '{"feature1" : value, "feature2": value, "feature3": value')

Sagemaker + SDK:

feature_set = store.get_feature_set('set_id', 'latest')

data = feature_set.get_training_data()

x_train, y_train, x_test, y_test = split_training_test(training_data)

model = store.sagemaker.fit_model(x_train, y_train)

url = feature_store.gen_live_prediction(sagemaker.deploy(model))

http.request(url, '{"user_id" : id')

Other Features

Normalisation

Normalisation is intrinsic to a Feature Set. For each feature, a normalisation method can be applied. In practice, this results in two things:

  • When generating training data, it's required to indicate whether data will be generated with data normalised or de-normalised
  • When generating a Prediction Endpoint, a parameter can be used to retrieve either normalised or denormalised data.
  • More than one normalisation can be applied to Features.

Local infra

A hosted version of the Feature Store will be self contained: as an example, an input of data in S3 can land as trained data in another s3 bucket, all just by using what the infra provide. We will charge based on AWS cost.

But it's possible that the clients want to use their own infra:

  • S3 for privacy purposes
  • A connection to a existing spark cluster to save money

Random

  • Consume images and use content from image recognition as features
  • Consume text and use sentiment analysis to extract features

Competitors

https://algorithmia.com/enterprise https://bigml.com/features https://www.dataiku.com/dss/features/machine-learning/ https://hydrosphere.io https://www.datarobot.com/ https://www.h2o.ai/h2o/

Related

http://blog.richardweiss.org/2016/10/13/kaggle-with-luigi.html

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment