-
Use
QUALIFY
to filter rows after a window function. It's way more computationally efficient than using a CTE.SELECT RANK() OVER (PARTITION BY product_id ORDER BY rating DESC) AS rank FROM table QUALIFY rank = 1
-
ARRAY_AGG()
doesn't preserve row order, so you need to specify the ordering within the function:ARRAY_AGG( STRUCT(rev.createdAt, rev.rating, rev.title, rev.content) ORDER BY rev.createdAt DESC ) AS PRODUCT_REVIEWS
-
BigQuery must NEVER be used in a
for
loop kind of query where you run the same query multiple times with different parameters. It's very expensive. Instead, try to combine multiple queries into one such as using theIN
clause in aWHERE
clause.-- DO NOT DO THIS SELECT * FROM table WHERE id = 1 UNION ALL SELECT * FROM table WHERE id = 2 -- DO THIS INSTEAD SELECT * FROM table WHERE id IN (1, 2)
-
BQ Table Value Functions (TVF) get executed whenever each row is created. This is troublesome if you're using TVF's to calculate something like ingestion time and have to recover a table using time travel. You're ingestion time for all the rows in the time travel recovered table will be the time that you recovered the table at, not the time that the row was originally ingested.
Last active
April 21, 2023 01:52
-
-
Save stephenleo/7df2558db7211803328e052747f233a2 to your computer and use it in GitHub Desktop.
DataEngineering
- Idenpotency: The same operation applied multiple times should have the same effect as applying it once. This ensures you can safely rerun pipeline without worry about the result changing.
- Reverse ETL: Extracting data from a data warehouse and loading it into a source system like SFMC or SAP. Popular choices are: Census, Hightouch and Rudderstack.
- Semantic Layer: A layer of abstraction between the data warehouse and the data mart that allows for consistent definitions of business KPIs in
yaml
files. Popular choices are: DBT Metrics and Cube.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment