ozkatz/POST.md Secret

## POST.md

      
    Raw
  

              POST.md
            
          
    Introducing: lakeFS hooks

TL;DR

The most common use case for lakeFS is CI and CD for data.
Continuous integration of data is the process of exposing data to consumers only after ensuring it adheres to best practices such as format, schema, and PII governance. Continuous deployment of data ensures the quality of data at each step of a production pipeline.
In this blog I will present lakeFS's web hooks, and showcase a few implementations of pre-commit and pre-merge hooks that allow CI/CD for data.
What are lakeFS hooks?

lakeFS hooks allow automating and ensuring a given set of checks and validations happen before important life-cycle events.
They are similar conceptually to Git Hooks, but unlike Git, they run remotely on a server so they are guaranteed to happen when the appropriate event is triggered.
Currently, lakeFS allows executing hooks when 2 type of events occur: pre-commit events that runs before a commit is acknowledged, and pre-merge events that trigger right before a merge operation.
For both event types, returning an error will cause lakeFS to block the operation from happening - and will return that failure to the requesting user.
This is an extremely powerful guarantee - We can now codify and automate the rules and practices that all Data Lake participants have to adhere to.
This guarantee is then made available to data consumers: You're reading from production/important/collection/? Great, you're guaranteed to never see breaking schema changes - and all data must have passed a known set of statistical quality checks. If it's on the main branch, it's safe.
How are lakeFS hooks configured?

Currently, one type of hook is supported: Webhook.. Webhooks are really powerful because they allow users to bring their own code with their own libraries and environment settings - and decouple the actual logic of executing the checks from lakeFS.
Additionally, deploying stateless web services using Lambda (or other serverless equivilents) or Kubernetes is relatively simple from an operational standpoint.
Deploying a lakeFS webhook requires 2 steps:

Set up a webserver that can accept a POST request.
This server will receive an HTTP POST request with a JSON body that describes the event that occured (complete with branch and repository information).
It must return an HTTP 2xx response for the commit or merge operation to continue, or any other status code to fail it.
Add a yaml file inside your lakeFS repository, under _lakefs_hooks/, with a unique name.
This file contains information about what types of event would trigger this action,
and 1 or more webhooks that should be triggered when this event occurs.

Example webhook (using Python & Flask):

import lakefs
from Flask import Flask, request, jsonify

app = Flask(__name__)

@app.route('/webhooks/csv')
def csv_webhook()
    client = lakefs.Client(...)
    event = request.get_json()

    for change in client.diff(event['repository'], event['source_ref'], event['branch_id']):
        if change.type == 'added' and change.path.endswith('.csv'):
            # Someone added a .csv file, this is forbidden
            #   and we should fail the merge operation
            return jsonify({'error': 'CSV files not allowed!'}), 400
    
    # No CSV files found, these changes are valid!
    return '', 200 
Of course, in reality, the logic could be arbitrarily more complex - Run actual computations on the changed data, talk to other external APIs, or perform your own specific custom business logic.
Example yaml definition file to set it up as a pre-merge hook (placed under _lakefs_hooks/no_csv.yaml):

---
name: NoCSVInMainBranch
description: This hook prevents CSV files from being merged to production branches
on:
  pre-merge:
    branches:
      - main
hooks:
  - id: csv_diff_webhook
    type: webhook
    description: Ensure no new objects end with .csv
    properties:
      url: "http://<host:port>/webhooks/csv"
And that's it.
Once the hook is deployed, we are now guaranteed that no CSV files will ever appear in the main branch.
A few useful examples

Webhooks allow users to add any custom behavior or validation they wish. A few useful examples:

Metadata validation: On merge, ensure new tables or partitions are registered in a Meta Store
Reproducibility guarantee: All writes to production/tables/ must also add commit metadata fields describing the Git commit hash of the job that produced them
Schema Enforcement: Allow only backwards-compatible changes to production table schemas. Disallow certain column names that expose personally-identifiable information from being written to paths that shouldn't contain them
Data Quality checks: Ensure the data itself meets a set of statistical tests to ensure their quality and avoid issues downstream
Format enforcement: Ensure the organization standardizes on columnar formats for analytics
Partial or corrpted data validation: make sure partitions are complete and have no duplications before merging them to production

Clone and start using it now

We're also open sourcing a set of reusable lakeFS hooks that provide some of the common functionallity as given above.
The lakeFS-hooks repository on GitHub includes 4 webhooks that provide the following validations:

Pre-merge hook for format validation (Example: ensure only Delta Lake, Parquet and ORC files exist under production/)
Pre-merge hook that does simple schema validation (Example: block files containing a user_id column from being written under a certain prefix)
Pre-commit hook that checks for immutability violations - ensures partions are either written as a whole, or replaced as a whole
Pre-commit hook that validates the existence of commit metadata when writing to a given prefix (Example: If a commit writes to a production/ path, it should include an owning_team, job_git_commit_hash and airflow_dag_url metadata fields).

For more information on how to use these hooks and how to extend them to support your own requirements, please visit the lakeFS-hooks repository on GitHub.

To get started with lakeFS, check out the official documentation and the GitHub Repository.