Last active
October 10, 2023 02:50
-
-
Save isha-dremio/7e9071f4a8b7b1c72499177af85d2e1d to your computer and use it in GitHub Desktop.
Build an Iceberg Lakehouse Workshop
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
1. Create a folder | |
CREATE FOLDER "catalog_name"."my_folder"; | |
2. Run SQL to create an Iceberg table | |
CREATE TABLE "catalog_name"."my_folder"."nyc_trips" ( | |
pickup_datetime TIMESTAMP, | |
passenger_count INT, | |
trip_distance_mi FLOAT, | |
fare_amount FLOAT, | |
tip_amount FLOAT, | |
total_amount FLOAT | |
); | |
3. Run SQL to migrate data to the Iceberg table. | |
COPY INTO "catalog_name"."my_folder"."nyc_trips" | |
FROM '@Samples/samples.dremio.com/' FILES('NYC-taxi-trips.csv'); | |
4. Verify COPY INTO | |
SELECT * FROM "catalog_name"."my_folder"."nyc_trips"; | |
5. Create a branch | |
CREATE BRANCH nyc_etl | |
AT BRANCH main | |
IN "catalog_name"; | |
6. Update the table | |
USE BRANCH nyc_etl | |
IN "catalog_name"; | |
UPDATE nyc_trips | |
SET passenger_count = 6 | |
WHERE passenger_count = 7; | |
7. Verify the update didn't impact the main branch | |
SELECT * | |
FROM "catalog_name".my_folder.nyc_trips AT BRANCH main | |
WHERE passenger_count = 7; | |
8. Merge the update into the main branch | |
MERGE BRANCH nyc_etl | |
INTO main | |
IN "catalog_name"; | |
9. Verify the changes in main | |
SELECT * | |
FROM "catalog_name".my_folder.nyc_trips AT BRANCH main | |
WHERE passenger_count = 7; | |
10. Create an aggregate reflection | |
ALTER TABLE "catalog_name"."my_folder"."nyc_trips" AT BRANCH main | |
CREATE AGGREGATE REFLECTION "taxi_reflection" | |
USING DIMENSIONS ("pickup_datetime") | |
MEASURES (passenger_count, trip_distance_mi, fare_amount, tip_amount, total_amount); | |
11. Create a view | |
CREATE VIEW "catalog_name"."my_folder"."nyc_trips_view" | |
AS SELECT * | |
FROM "catalog_name"."my_folder"."nyc_trips" | |
AT BRANCH main; | |
12. Run a query with the acceleration | |
SELECT DATE_PART('DAY', "pickup_datetime") AS "pickup_day", | |
SUM("passenger_count") AS "passenger_count" | |
FROM "catalog_name"."my_folder"."nyc_trips_view" AT BRANCH main | |
GROUP BY pickup_day | |
ORDER BY pickup_day; | |
13. Add a wiki to your view | |
This view contains taxi trips data from NYC. It can be used to determine the taxi trends such as frequency of trips, distance traveled, and average fares or tips. | |
Questions? Contact gnarly@dremio.com | |
### Columns | |
`pickup_datetime`: Time at which the passenger was picked up. | |
`passenger_count`: The number of passengers that were picked up for the taxi ride. | |
`trip_distance_mi`: The total distance of the trip from pickup to drop off locations in miles. | |
* Max: 351.0 | |
`fare_amount`: The dollar amount that the ride cost. | |
* Max: 158995.81 | |
`tip_amount`: The dollar amount that the customer tipped in addition to the fare_amount. | |
* Max: 888.19 | |
`total_amount`: The total dollar amount that the customer paid (fare + tip). | |
### Examples | |
`SELECT DATE_PART('MONTH', "nyc-trips"."pickup_datetime") AS "pickup_dateime", | |
SUM("nyc-trips"."passenger_count") AS "passenger_count" | |
FROM "catalog_name"."my_folder"."nyc-trips" AT BRANCH main | |
GROUP BY DATE_PART('MONTH', "nyc-trips"."pickup_datetime");` |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment