datajoely/layers.md

## layers.md

      
    Raw
  

              layers.md
            
          
Layer
Order
Description


raw
Sequential
Initial start of the pipeline, containing the sourced data model(s) that should never be changed, it forms your single source of truth to work from. These data models can be un-typed in most cases e.g. csv, but this will vary from case to case. Given the relative cost of storage today, painful experience suggests it's safer to never work with the original data directly!


intermediate
Sequential
This stage is optional if your data is already typed. Typed representation of the raw layer e.g. converting string based values into their current typed representation as numbers, dates etc. Our recommended approach is to mirror the raw layer in a typed format like Apache Parquet. Avoid transforming the structure of the data, but simple operations like cleaning up field names or 'unioning' mutli-part CSVs are permitted.


primary
Sequential
Domain specific data model(s) containing cleansed, transformed and wrangled data from either raw or intermediate, which forms your layer that can be treated as the workspace for any feature engineering down the line. This holds the data transformed into a model that fits the problem domain in question. If you are working with data which is already formatted for the problem domain, it is reasonable to skip to this point.


feature
Sequential
Analytics specific data model(s) containing a set of features defined against the primary data, which are grouped by feature area of analysis and stored against a common dimension. In practice this covers the independent variables and target variable which will form the basis for ML exploration and application. Since this framework was designed MLOps tooling has progressed and now 'Feature Stores' (such as Feast or SageMaker Feature Store) provide a versioned, centralised storage location with low-latency serving. This separation still fits in well within this conceptual framework.


model_input
Sequential
Analytics specific data model(s) containing all feature data against a common dimension and in the case of live projects against an analytics run date to ensure that you track the historical changes of the features over time. Many places call these the 'Master Table(s)', we believe this terminology is more precise and covers multi-models pipelines better.


models
Sequential
Stored, serialised pre-trained machine learning models. In the simplest case, these are stored as something like a pickle file on a filesystem. More mature implementations would leverage MLOps frameworks that provide model serving such as MLFlow.


model output
Sequential
Analytics specific data model(s) containing the results generated by the model based on the model input data.


reporting
Free-form
Used for outputting analyses or modeling results that are often Ad Hoc or simply descriptive reports.
Layer	Order	Description
`raw`	Sequential	Initial start of the pipeline, containing the sourced data model(s) that should never be changed, it forms your single source of truth to work from. These data models can be un-typed in most cases e.g. `csv`, but this will vary from case to case. Given the relative cost of storage today, painful experience suggests it's safer to never work with the original data directly!
`intermediate`	Sequential	This stage is optional if your data is already typed. Typed representation of the raw layer e.g. converting string based values into their current typed representation as numbers, dates etc. Our recommended approach is to mirror the raw layer in a typed format like Apache Parquet. Avoid transforming the structure of the data, but simple operations like cleaning up field names or 'unioning' mutli-part CSVs are permitted.
`primary`	Sequential	Domain specific data model(s) containing cleansed, transformed and wrangled data from either raw or intermediate, which forms your layer that can be treated as the workspace for any feature engineering down the line. This holds the data transformed into a model that fits the problem domain in question. If you are working with data which is already formatted for the problem domain, it is reasonable to skip to this point.
`feature`	Sequential	Analytics specific data model(s) containing a set of features defined against the primary data, which are grouped by feature area of analysis and stored against a common dimension. In practice this covers the independent variables and target variable which will form the basis for ML exploration and application. Since this framework was designed MLOps tooling has progressed and now 'Feature Stores' (such as Feast or SageMaker Feature Store) provide a versioned, centralised storage location with low-latency serving. This separation still fits in well within this conceptual framework.
`model_input`	Sequential	Analytics specific data model(s) containing all feature data against a common dimension and in the case of live projects against an analytics run date to ensure that you track the historical changes of the features over time. Many places call these the 'Master Table(s)', we believe this terminology is more precise and covers multi-models pipelines better.
`models`	Sequential	Stored, serialised pre-trained machine learning models. In the simplest case, these are stored as something like a pickle file on a filesystem. More mature implementations would leverage MLOps frameworks that provide model serving such as MLFlow.
`model output`	Sequential	Analytics specific data model(s) containing the results generated by the model based on the model input data.
`reporting`	Free-form	Used for outputting analyses or modeling results that are often Ad Hoc or simply descriptive reports.