Skip to content

Instantly share code, notes, and snippets.

@stephenleo
Last active April 21, 2023 01:52
Show Gist options
  • Save stephenleo/7df2558db7211803328e052747f233a2 to your computer and use it in GitHub Desktop.
Save stephenleo/7df2558db7211803328e052747f233a2 to your computer and use it in GitHub Desktop.
DataEngineering

Big Query and SQL Hacks

  1. Use QUALIFY to filter rows after a window function. It's way more computationally efficient than using a CTE.

    SELECT RANK() OVER (PARTITION BY product_id ORDER BY rating DESC) AS rank 
    FROM table
    QUALIFY rank = 1
    
  2. ARRAY_AGG() doesn't preserve row order, so you need to specify the ordering within the function:

    ARRAY_AGG(
        STRUCT(rev.createdAt, rev.rating, rev.title, rev.content) 
        ORDER BY rev.createdAt DESC
        ) AS PRODUCT_REVIEWS
    
  3. BigQuery must NEVER be used in a for loop kind of query where you run the same query multiple times with different parameters. It's very expensive. Instead, try to combine multiple queries into one such as using the IN clause in a WHERE clause.

    -- DO NOT DO THIS
    SELECT * FROM table WHERE id = 1
    
    UNION ALL
    
    SELECT * FROM table WHERE id = 2
    
    -- DO THIS INSTEAD
    SELECT * FROM table WHERE id IN (1, 2)
    
  4. BQ Table Value Functions (TVF) get executed whenever each row is created. This is troublesome if you're using TVF's to calculate something like ingestion time and have to recover a table using time travel. You're ingestion time for all the rows in the time travel recovered table will be the time that you recovered the table at, not the time that the row was originally ingested.

Important Data Engineering concepts

  1. Idenpotency: The same operation applied multiple times should have the same effect as applying it once. This ensures you can safely rerun pipeline without worry about the result changing.
  2. Reverse ETL: Extracting data from a data warehouse and loading it into a source system like SFMC or SAP. Popular choices are: Census, Hightouch and Rudderstack.
  3. Semantic Layer: A layer of abstraction between the data warehouse and the data mart that allows for consistent definitions of business KPIs in yaml files. Popular choices are: DBT Metrics and Cube.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment