Skip to content

Instantly share code, notes, and snippets.

@isha-dremio
Last active October 10, 2023 02:50
Show Gist options
  • Save isha-dremio/7e9071f4a8b7b1c72499177af85d2e1d to your computer and use it in GitHub Desktop.
Save isha-dremio/7e9071f4a8b7b1c72499177af85d2e1d to your computer and use it in GitHub Desktop.
Build an Iceberg Lakehouse Workshop
1. Create a folder
CREATE FOLDER "catalog_name"."my_folder";
2. Run SQL to create an Iceberg table
CREATE TABLE "catalog_name"."my_folder"."nyc_trips" (
pickup_datetime TIMESTAMP,
passenger_count INT,
trip_distance_mi FLOAT,
fare_amount FLOAT,
tip_amount FLOAT,
total_amount FLOAT
);
3. Run SQL to migrate data to the Iceberg table.
COPY INTO "catalog_name"."my_folder"."nyc_trips"
FROM '@Samples/samples.dremio.com/' FILES('NYC-taxi-trips.csv');
4. Verify COPY INTO
SELECT * FROM "catalog_name"."my_folder"."nyc_trips";
5. Create a branch
CREATE BRANCH nyc_etl
AT BRANCH main
IN "catalog_name";
6. Update the table
USE BRANCH nyc_etl
IN "catalog_name";
UPDATE nyc_trips
SET passenger_count = 6
WHERE passenger_count = 7;
7. Verify the update didn't impact the main branch
SELECT *
FROM "catalog_name".my_folder.nyc_trips AT BRANCH main
WHERE passenger_count = 7;
8. Merge the update into the main branch
MERGE BRANCH nyc_etl
INTO main
IN "catalog_name";
9. Verify the changes in main
SELECT *
FROM "catalog_name".my_folder.nyc_trips AT BRANCH main
WHERE passenger_count = 7;
10. Create an aggregate reflection
ALTER TABLE "catalog_name"."my_folder"."nyc_trips" AT BRANCH main
CREATE AGGREGATE REFLECTION "taxi_reflection"
USING DIMENSIONS ("pickup_datetime")
MEASURES (passenger_count, trip_distance_mi, fare_amount, tip_amount, total_amount);
11. Create a view
CREATE VIEW "catalog_name"."my_folder"."nyc_trips_view"
AS SELECT *
FROM "catalog_name"."my_folder"."nyc_trips"
AT BRANCH main;
12. Run a query with the acceleration
SELECT DATE_PART('DAY', "pickup_datetime") AS "pickup_day",
SUM("passenger_count") AS "passenger_count"
FROM "catalog_name"."my_folder"."nyc_trips_view" AT BRANCH main
GROUP BY pickup_day
ORDER BY pickup_day;
13. Add a wiki to your view
This view contains taxi trips data from NYC. It can be used to determine the taxi trends such as frequency of trips, distance traveled, and average fares or tips.
Questions? Contact gnarly@dremio.com
### Columns
`pickup_datetime`: Time at which the passenger was picked up.
`passenger_count`: The number of passengers that were picked up for the taxi ride.
`trip_distance_mi`: The total distance of the trip from pickup to drop off locations in miles.
* Max: 351.0
`fare_amount`: The dollar amount that the ride cost.
* Max: 158995.81
`tip_amount`: The dollar amount that the customer tipped in addition to the fare_amount.
* Max: 888.19
`total_amount`: The total dollar amount that the customer paid (fare + tip).
### Examples
`SELECT DATE_PART('MONTH', "nyc-trips"."pickup_datetime") AS "pickup_dateime",
SUM("nyc-trips"."passenger_count") AS "passenger_count"
FROM "catalog_name"."my_folder"."nyc-trips" AT BRANCH main
GROUP BY DATE_PART('MONTH', "nyc-trips"."pickup_datetime");`
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment