Skip to content

Instantly share code, notes, and snippets.

View DaveRuijter's full-sized avatar

Dave Ruijter DaveRuijter

View GitHub Profile
@DaveRuijter
DaveRuijter / pipeline-backup-weekly.yml
Created October 21, 2021 20:30
This YAML is part of the Data Lake Backup Strategy
parameters:
- name: backupStore
displayName: 'Backup 05 store'
type: boolean
default: true
- name: backupBronze
displayName: 'Backup 10 bronze'
type: boolean
default: true
- name: backupSilver
@DaveRuijter
DaveRuijter / pipeline-backup-daily.yml
Created October 21, 2021 20:32
This YAML pipeline is part of the Data Lake Backup Strategy
parameters:
- name: backupStore
displayName: 'Backup 05 store'
type: boolean
default: true
- name: backupBronze
displayName: 'Backup 10 bronze'
type: boolean
default: true
- name: backupSilver
@DaveRuijter
DaveRuijter / is_pipeline_running.json
Created December 12, 2021 11:39
ADF/ASA pipeline to verify if a pipeline is running / in progress
{
"name": "00_is_pipeline_running",
"properties": {
"activities": [
{
"name": "Get Pipeline Runs",
"type": "WebActivity",
"dependsOn": [
{
"activity": "getSubscriptionID",
@DaveRuijter
DaveRuijter / multicolumn_expression_evaluation.py
Created January 2, 2022 09:04
Custom multi-column sql expression evaluation expectation for the Great Expectation framework
from great_expectations.expectations.expectation import MulticolumnMapExpectation
from great_expectations.expectations.util import render_evaluation_parameter_string
from great_expectations.render.util import (
num_to_str,
substitute_none_for_missing,
parse_row_condition_string_pandas_engine,
)
from scipy import stats as stats
from great_expectations.execution_engine import (
PandasExecutionEngine,
@DaveRuijter
DaveRuijter / generate_hash.py
Created April 3, 2022 07:13
Couple functions to easily create an integer based hash. Use it for the key column of a dimension.
spark.udf.register("udf_removehtmltagsfromstring", udf_removehtmltagsfromstring, "string")
# This is the central hashing function, used by other functions. It uses the blake2b hashing algorithm. With a central function, we can adjust the hashing when needed.
def udf_centralhash(string: str) -> int:
val = hashlib.blake2b(
digest_size=6
) # Increase digest size to make the hashing bigger. 6 seems a good start for our use for dimensions.
val.update(string.encode("utf-8")) # give the input string as utf-8 to the blake2b object
intval = int(val.hexdigest(), 16) # and convert it to an integer