Skip to content

Instantly share code, notes, and snippets.

View DaveRuijter's full-sized avatar

Dave Ruijter DaveRuijter

View GitHub Profile
@DaveRuijter
DaveRuijter / generate_hash.py
Created April 3, 2022 07:13
Couple functions to easily create an integer based hash. Use it for the key column of a dimension.
spark.udf.register("udf_removehtmltagsfromstring", udf_removehtmltagsfromstring, "string")
# This is the central hashing function, used by other functions. It uses the blake2b hashing algorithm. With a central function, we can adjust the hashing when needed.
def udf_centralhash(string: str) -> int:
val = hashlib.blake2b(
digest_size=6
) # Increase digest size to make the hashing bigger. 6 seems a good start for our use for dimensions.
val.update(string.encode("utf-8")) # give the input string as utf-8 to the blake2b object
intval = int(val.hexdigest(), 16) # and convert it to an integer
@DaveRuijter
DaveRuijter / multicolumn_expression_evaluation.py
Created January 2, 2022 09:04
Custom multi-column sql expression evaluation expectation for the Great Expectation framework
from great_expectations.expectations.expectation import MulticolumnMapExpectation
from great_expectations.expectations.util import render_evaluation_parameter_string
from great_expectations.render.util import (
num_to_str,
substitute_none_for_missing,
parse_row_condition_string_pandas_engine,
)
from scipy import stats as stats
from great_expectations.execution_engine import (
PandasExecutionEngine,
@DaveRuijter
DaveRuijter / is_pipeline_running.json
Created December 12, 2021 11:39
ADF/ASA pipeline to verify if a pipeline is running / in progress
{
"name": "00_is_pipeline_running",
"properties": {
"activities": [
{
"name": "Get Pipeline Runs",
"type": "WebActivity",
"dependsOn": [
{
"activity": "getSubscriptionID",
@DaveRuijter
DaveRuijter / pipeline-backup-daily.yml
Created October 21, 2021 20:32
This YAML pipeline is part of the Data Lake Backup Strategy
parameters:
- name: backupStore
displayName: 'Backup 05 store'
type: boolean
default: true
- name: backupBronze
displayName: 'Backup 10 bronze'
type: boolean
default: true
- name: backupSilver
@DaveRuijter
DaveRuijter / pipeline-backup-weekly.yml
Created October 21, 2021 20:30
This YAML is part of the Data Lake Backup Strategy
parameters:
- name: backupStore
displayName: 'Backup 05 store'
type: boolean
default: true
- name: backupBronze
displayName: 'Backup 10 bronze'
type: boolean
default: true
- name: backupSilver
@DaveRuijter
DaveRuijter / stage-backup-dls.yml
Created October 21, 2021 20:28
This YAML file is part of the Data Lake Backup Strategy
parameters:
- name: dependsOnStage
type: string
- name: triggerPeriod
type: string
- name: environment
type: string
- name: backupStore
type: boolean
- name: backupBronze
@DaveRuijter
DaveRuijter / job-backup-dls.yml
Last active November 23, 2021 19:32
This YAML file is part of the Backup Strategy
parameters:
- name: backups
displayName: 'Array of backups'
type: object
default: []
- name: serviceConnectionName
displayName: 'Name of the DevOps Service Connection'
type: string
- name: execute
displayName: 'Execute this Job'
@DaveRuijter
DaveRuijter / backup-dls.ps1
Created October 21, 2021 20:16
This PowerShell script performs a copy between two storage accounts using AzCopy.
param(
[String]$sourceStorageAccount,
[String]$targetStorageAccount,
[String]$sourceFolder,
[String]$targetFolder,
[String]$sourceSasToken,
[String]$targetSasToken,
[String]$triggerPeriod,
[Int32]$azCopyConcurrency
)
@DaveRuijter
DaveRuijter / data_lake_sta_lifecycle_policy_rules.json
Created October 21, 2021 19:28
This policy is used on the storage account of the Data Lake. This will ensure new data in the dls/05_store/_archive folder of the lake is automatically assigned to the cool access tier.
{
"rules": [
{
"enabled": true,
"name": "daily-moving-data-lake-store-archive-to-cool",
"type": "Lifecycle",
"definition": {
"actions": {
"baseBlob": {
"tierToCool": {
@DaveRuijter
DaveRuijter / backup_sta_lifecycle_policy_rules.json
Created October 21, 2021 19:27
This policy is used on the storage account that contains the backup copies of the Data Lake. This will apply retention of 60 days to the weekly backups, and retention of 30 days to the daily (incremental) backups.
{
"rules": [
{
"enabled": true,
"name": "weeklybackupsrule",
"type": "Lifecycle",
"definition": {
"actions": {
"baseBlob": {
"delete": {