Skip to content

Instantly share code, notes, and snippets.

@clrcrl
Last active November 27, 2023 16:39
Show Gist options
  • Save clrcrl/ca698efdc8700ee8cb7e6f5101426a33 to your computer and use it in GitHub Desktop.
Save clrcrl/ca698efdc8700ee8cb7e6f5101426a33 to your computer and use it in GitHub Desktop.
Using dbt to version-control macros

This is the process I used to put redshift user defined functions into dbt.

  1. Created a subdirectory: macros/udfs/
  2. Created a file for each udf, e.g. macros/udfs/f_future_date.sql.
{% macro f_future_date() %}
CREATE OR REPLACE FUNCTION {{target.schema}}.f_future_date()
RETURNS TIMESTAMP
IMMUTABLE AS $$
SELECT '2100-01-01'::TIMESTAMP;
$$ LANGUAGE sql
{% endmacro %}
  1. Created a macro called macros/create_udfs.sql which calls each UDF macro. Note separation with ;s.
{% macro create_udfs() %}

{{f_list_custom_keys()}};

{{f_count_custom_keys()}};

{{f_future_date()}}

{% endmacro %}
  1. Added a on-run-start hook to my project
on-run-start:
    - '{{create_udfs()}}'
  1. Updated the references to UDFs in my models to use the schema-versions, e.g.
SELECT
{{target.schema}}.F_FUTURE_DATE()

The “gotcha” parts were:

  1. I had to grant permission to users to create python udfs:
-- as your superuser
GRANT USAGE ON LANGUAGE PLPYTHONU TO claire;
  1. I had to use schemas because users cannot edit each others’ UDFs (i.e. if claire creates the UDF, and sinter runs the CREATE OR REPLACE FUNCTION statement, you'll get a "must be owner of function" error). Using schemas means you are creating distinct UDFs* so won’t hit this issue. I would recommend using schemas anyway to maintain separate dev/production UDFs

* Assuming each redshift user profile used by dbt has a distinct target schema.

@ramcdfe
Copy link

ramcdfe commented Nov 27, 2023

Hi Claire!

I wonder if you have tried creating a UDF that queries another available source in your project using an input parameter. For example, I am trying to adapt a SQL Server UDF that reads a table 'Projects' from my raw database and returns a Boolean result based on the input variable 'id'. The function is the following:

CREATE FUNCTION fn_ProjectIsAvailable
(
    @Id BIGINT
)
RETURNS BIT
AS
BEGIN

RETURN ISNULL((SELECT TOP 1 1
            FROM Projects p WHERE p.Id= @Id
            ),0)
END

I have tried adapting this UDF following your post. I am using DBT using SparkSQL for Databricks. I created the following macros:

my_project/macros/udfs/fn_ProjectIsAvailable.sql
my_project/macros/create_udfs.sql

{% macro create_fn_ProjectIsAvailable() %}

CREATE OR REPLACE FUNCTION {{ target.schema }}.fn_ProjectIsAvailable(Id BIGINT)
RETURNS BIT
AS

COALESCE
(
    (
        SELECT 1
        FROM {{ source('raw_db', 'projects') }} p 
        WHERE 
            p.Id = {{ Id }}
        LIMIT 1
    ), 0
)

{% endmacro %}

where:
{{ source('raw_db', 'projects') }} is the same Project table from the SQL Server database that I already ingested into a Databricks schema 'raw_db' using DBT. It is basically a transactional catalogue of projects with their ids.

and

{% macro create_udfs() %}

{{create_fn_ProjectIsAvailable()}};

{% endmacro %}

I also added this entry to my dbt_project.yml file:

on-run-start:
  - '{{ create_udfs() }}'

Now, I would like to use this UDF/macro in another DBT model where I intend to use this UDF in this way:

my_project/models/dwh_ProjectCurrentStatus.sql

SELECT {{target.schema}}.fn_ProjectIsAvailable(Id) AS IsProjectCurrentlyAvailable
FROM {{ source('raw_db', 'projects') }}

Unfortunately, when I compile the preview on this model, I realised that the 'Id' variable is not being parsed to this query:

SELECT
COALESCE
(
    (
        SELECT 1
        FROM `raw_db`.`projects` p 
        WHERE 
            p.Id = 
        LIMIT 1
    ), 0
) AS IsProjectAvailable
FROM  `raw_db`.`projects`

Please notice how the p.Id = receives a blank.

I don't really know how to solve this and I would really appreciate any feedback. Many thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment