Skip to content

Instantly share code, notes, and snippets.

View datajoely's full-sized avatar
🤠

Joel datajoely

🤠
View GitHub Profile
@datajoely
datajoely / layers.md
Last active September 4, 2023 10:22
Kedro data layers
Layer Order Description
raw Sequential Initial start of the pipeline, containing the sourced data model(s) that should never be changed, it forms your single source of truth to work from. These data models can be un-typed in most cases e.g. csv, but this will vary from case to case. Given the relative cost of storage today, painful experience suggests it's safer to never work with the original data directly!
intermediate Sequential This stage is optional if your data is already typed. Typed representation of the raw layer e.g. converting string based values into their current typed representation as numbers, dates etc. Our recommended approach is to mirror the raw layer in a typed format like Apache Parquet. Avoid transforming the structure of the data, but simple operations like cleaning up field names or 'unioning' mutli-part CSVs are permitted.
primary Sequential
Kedro layer Comment
raw In this situation 3 data source are described: an Excel file, a multi-part CSV export from a database as well as a single CSV export from a personnel management system.
intermediate The intermediate layer is a typed mirror of the raw layer with a minor transformation applied to the equipment extract since the multi-part data received has been concatenated into a single parquet dataset.
primary Two domain level datasets have been constructed from the intermediate layer which model equipment shutdowns and operator actions.
feature Several features have been constructed form the primary layer which represent variables we think may be predictors of equipment shutdowns such as the maintenance schedule and recent shutdowns.
model_input Two model inputs have been created since we are experimenting with two modeling approaches, one time-series based and another equipment centric without a temporal element.
models The trained models constructed have been serialise
def create_template_pipeline() -> Pipeline:
""" Template declareed here with real inputs, but placeholder outputs and parameters """
return Pipeline(
[
node(
func=create_model_inputs,
inputs=[ # These inputs are never overriden
"feat_days_since_last_shutdown",
"feat_days_between_shutdown_last_maintenance",
"feat_fte_maintenance_hours_last_6m",
@datajoely
datajoely / dataset.py
Created September 3, 2021 08:24
Expiring HTTP dataset
"""
This module provides custom Kedro dataset
"""
import hashlib
import json
import logging
from pathlib import Path
from typing import Any, Dict, Optional, Union
from urllib.parse import urlparse