Skip to content

Instantly share code, notes, and snippets.

@datajoely
Last active September 4, 2023 10:22
Show Gist options
  • Save datajoely/c145b766c66a746e8e1112df1382336b to your computer and use it in GitHub Desktop.
Save datajoely/c145b766c66a746e8e1112df1382336b to your computer and use it in GitHub Desktop.
Kedro data layers
Layer Order Description
raw Sequential Initial start of the pipeline, containing the sourced data model(s) that should never be changed, it forms your single source of truth to work from. These data models can be un-typed in most cases e.g. csv, but this will vary from case to case. Given the relative cost of storage today, painful experience suggests it's safer to never work with the original data directly!
intermediate Sequential This stage is optional if your data is already typed. Typed representation of the raw layer e.g. converting string based values into their current typed representation as numbers, dates etc. Our recommended approach is to mirror the raw layer in a typed format like Apache Parquet. Avoid transforming the structure of the data, but simple operations like cleaning up field names or 'unioning' mutli-part CSVs are permitted.
primary Sequential Domain specific data model(s) containing cleansed, transformed and wrangled data from either raw or intermediate, which forms your layer that can be treated as the workspace for any feature engineering down the line. This holds the data transformed into a model that fits the problem domain in question. If you are working with data which is already formatted for the problem domain, it is reasonable to skip to this point.
feature Sequential Analytics specific data model(s) containing a set of features defined against the primary data, which are grouped by feature area of analysis and stored against a common dimension. In practice this covers the independent variables and target variable which will form the basis for ML exploration and application. Since this framework was designed MLOps tooling has progressed and now 'Feature Stores' (such as Feast or SageMaker Feature Store) provide a versioned, centralised storage location with low-latency serving. This separation still fits in well within this conceptual framework.
model_input Sequential Analytics specific data model(s) containing all feature data against a common dimension and in the case of live projects against an analytics run date to ensure that you track the historical changes of the features over time. Many places call these the 'Master Table(s)', we believe this terminology is more precise and covers multi-models pipelines better.
models Sequential Stored, serialised pre-trained machine learning models. In the simplest case, these are stored as something like a pickle file on a filesystem. More mature implementations would leverage MLOps frameworks that provide model serving such as MLFlow.
model output Sequential Analytics specific data model(s) containing the results generated by the model based on the model input data.
reporting Free-form Used for outputting analyses or modeling results that are often Ad Hoc or simply descriptive reports.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment