Skip to content

Instantly share code, notes, and snippets.

@jwkidd3
Created September 23, 2022 13:53
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save jwkidd3/8e511c7745776eafe11582035eec9ef7 to your computer and use it in GitHub Desktop.
Save jwkidd3/8e511c7745776eafe11582035eec9ef7 to your computer and use it in GitHub Desktop.
* All data can be found at s3://databrickssql/
* Create the following databases:
* {UNIQUE_IDENTIFIER}_retail
* {UNIQUE_IDENTIFIER}_nyse
* Number of orders placed every month. We have to consider only COMPLETE as well as CLOSED orders. Make sure output is sorted by month.
* Database: retail
* Table: orders
* Number of orders placed in the months of 2013 August, 2013 September and 2013 October. We have to consider all PENDING orders (PENDING, PENDING_PAYMENT)
* Database: retail
* Table: orders
* Get revenue for each order (using order_items). Output should contain order id and revenue, revenue should be rounded off to 2 digits. Also output should be saved to a new table called order_revenue (file format parquet).
* Source Database: retail
* Source Table: order_items
* Target Database: {UNIQUE_IDENTIFIER}_retail
* Target Table: order_revenue
* Reference for [Creating Tables using Spark SQL](https://docs.databricks.com/spark/latest/spark-sql/language-manual/create-table.html)
* Create tables for products, categories and departments, then load the data into all the tables. Use the below data model to come up with column names.
* Base Directory for all data sets **/data/retail_db**
* File Format: TEXTFILE
* Delimiter: Comma ","
* Create a new table for NYSE Data.
* Database Name: **{UNIQUE_IDENTIFIER}_nyse**
* Table Name: **nyse_eod**
* Type: **EXTERNAL**
* File Type: **TEXTFILE** (default)
* Location: **s3://databrickssql/nyse_data/**
* Fields: **stockticker, tradedate, openprice, highprice, lowprice, closeprice, volume**
* Data Types: **Prices are of type float, stockticker is of type string, tradedate is of type int and volume is of type bigint.**
* Field Delimiter: **Comma (,)**
* Run `SELECT * FROM ` the table name to ensure data is loaded properly. Also run the count to make sure that all the data is loaded into the table.
* Compute Daily Product Revenue
* Source Database: **retail**
* Consider only COMPLETE and CLOSED orders
* Output should contain - order_date, product_name, revenue based on order_item_subtotal
* Round off revenue to 2 digits
* Save output to daily_product_revenue (sort the data by order_date in ascending and revenue in descending).
* Target Database: **{UNIQUE_IDENTIFIER}_retail**
* Target File Format: TEXTFILE
* Get the details of orders where there are no records in order_items.
* Source Database: retail
* Tables: orders and order_items
* Data Validation: Validate if order_item_subtotal is equal to product of order_item_product_price and order_item_quantity.
* Source Database: retail
* Source Table: order_items
* Get the order count for each day and status sorted by order_date in ascending order and then by order_status where data is sorted COMPLETE, CLOSED and rest in ascending order.
* Source Database: retail
* Source Table: orders
* Create FACT Table(s) for the following reports:
* Daily Product Revenue
* Daily Category Revenue
* Daily Department Revenue
* Roll up to Weekly, Monthly, Yearly
* Support To Date Reports
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment