https://eng.uber.com/michelangelo/
Finding good features is often the hardest part of machine learning and we have found that building and managing data pipelines is typically one of the most costly pieces of a complete machine learning solution.
A platform should provide standard tools for building data pipelines to generate feature and label data sets for training (and re-training) and feature-only data sets for predicting. These tools should have deep integration with the company’s data lake or warehouses and with the company’s online data serving systems. The pipelines need to be scalable and performant, incorporate integrated monitoring for data flow and data quality, and support both online and offline training and predicting. Ideally, they should also generate the features in a way that is shareable across teams to reduce duplicate work and increase data quality. They should also provide strong guard rails and controls to encourage and empower users to adopt best practices (e.g., making it easy to guarantee that the same data generation/preparation process is used at both training time and prediction time).
(...) (a) feature store that allows teams to share, discover, and use a highly curated set of features for their machine learning problems.
- Consume data from sources and transform them into Features for ML experiments.
- Being able to produce a Feature Set by consuming and mix Features from different sources (including public, curated data sources), with the help of Feature Discovery tooling.
- Feature Intelligence:
- Automated Feature extraction (featuretools.com)
- Feature Visualization
- Pre-trained model integration to generate new features
- Any type of Feature Selection aid
- Feature set management and versioning
- Training Data generation and management
- Prediction Endpoint generation
- Availability of an SDK for rapid model training and deployment using existing solutions (Sagemaker, Comet ML, Valohai, others, ...)
Data Sources:
- Blurr's DTCs
- Database Connections
- Results from DTC Experiments
Once the Features comprising the feature set are chosen, the feature set can be saved using a name.
Every time the feature set is updated and changes are saved, a new version is produced. This is important, because training data and prediction endpoints over the Feature Set are likely incompatible between versions.
Other options:
- We can create a version of the Feature Set only when Training Data is created or a Prediction Endpoint is deployed. In the meantime, we'll just save the latest state.
At any time, Training Data can be generated over a Feature Set. Once done, an S3 link to the data is provided.
The generation can be parameterised:
- Generate training data only within a range of timestamps
- Generate partitioned data
Training Data can be generated multiple times for a version of a Feature Set. Generation can be also scheduled, in case of requiring a new model being trained in a regular fashion.
Training Data can potentially be compatible with multiple versions of a Changeset, depending on the nature of the changes. Example: normalisation can be added to the feature set, that won't affect existing denormalised Training Data.
A Prediction Endpoint can be generated for a specific version of a Feature Set. A URL with the endpoint is provided.
An opt-in SDK can be used to facilitate work with tools like Sagemaker.
Sagemaker:
data_a = pandas.read_something('data_source_a')
data_a = do_all_transforms(data_a)
data_b = pandas.read_something('data_source_b')
data_b = do_all_transforms(data_b)
training_data = do_all_joining(data_a, data_b)
x_train, y_train, x_test, y_test = split_training_test(training_data)
model = sagemaker.fit_model(x_train, y_train)
url = sagemaker.deploy(model)
http.request(url, '{"feature1" : value, "feature2": value, "feature3": value')
Sagemaker + SDK:
feature_set = store.get_feature_set('set_id', 'latest')
data = feature_set.get_training_data()
x_train, y_train, x_test, y_test = split_training_test(training_data)
model = store.sagemaker.fit_model(x_train, y_train)
url = feature_store.gen_live_prediction(sagemaker.deploy(model))
http.request(url, '{"user_id" : id')
Normalisation is intrinsic to a Feature Set. For each feature, a normalisation method can be applied. In practice, this results in two things:
- When generating training data, it's required to indicate whether data will be generated with data normalised or de-normalised
- When generating a Prediction Endpoint, a parameter can be used to retrieve either normalised or denormalised data.
- More than one normalisation can be applied to Features.
A hosted version of the Feature Store will be self contained: as an example, an input of data in S3 can land as trained data in another s3 bucket, all just by using what the infra provide. We will charge based on AWS cost.
But it's possible that the clients want to use their own infra:
- S3 for privacy purposes
- A connection to a existing spark cluster to save money
- Consume images and use content from image recognition as features
- Consume text and use sentiment analysis to extract features
https://algorithmia.com/enterprise https://bigml.com/features https://www.dataiku.com/dss/features/machine-learning/ https://hydrosphere.io https://www.datarobot.com/ https://www.h2o.ai/h2o/
http://blog.richardweiss.org/2016/10/13/kaggle-with-luigi.html