Last active
February 13, 2023 22:23
-
-
Save datadu-de/c2be6b0458f0a46c9e24d6a9e5a66466 to your computer and use it in GitHub Desktop.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
{"cells":[{"cell_type":"markdown","source":["## Store secrets in Azure Key Vault\n<https://learn.microsoft.com/en-us/azure/databricks/security/secrets/secret-scopes#--create-an-azure-key-vault-backed-secret-scope>"],"metadata":{"application/vnd.databricks.v1+cell":{"showTitle":false,"cellMetadata":{},"nuid":"69c7c30b-bb87-480a-809c-bf3864d2a2ef","inputWidgets":{},"title":""}}},{"cell_type":"markdown","source":["## Connect to MDW (Azure SQL)\n<https://learn.microsoft.com/en-us/azure/databricks/external-data/sql-server>"],"metadata":{"application/vnd.databricks.v1+cell":{"showTitle":false,"cellMetadata":{},"nuid":"c3251b49-1370-4525-b190-552dbe0375f0","inputWidgets":{},"title":""}}},{"cell_type":"code","source":["driver = \"com.microsoft.sqlserver.jdbc.SQLServerDriver\"\n\ndatabase_host = dbutils.secrets.get(scope=\"azure-key-vault\", key=\"database-host\")\ndatabase_port = \"1433\" # update if you use a non-default port\ndatabase_name = dbutils.secrets.get(scope=\"azure-key-vault\", key=\"database-name\")\n\nsqluser = dbutils.secrets.get(scope=\"azure-key-vault\", key=\"sqluser\")\nsqlpassword = dbutils.secrets.get(scope=\"azure-key-vault\", key=\"sqlpassword\")\n\nurl = f\"jdbc:sqlserver://{database_host}:{database_port};databaseName={database_name}\""],"metadata":{"application/vnd.databricks.v1+cell":{"showTitle":false,"cellMetadata":{},"nuid":"920c56dc-d6fd-4ce0-bdfd-d85f1c4eeb01","inputWidgets":{},"title":""}},"outputs":[],"execution_count":0},{"cell_type":"markdown","source":["### Read Customer dimension\n<https://learn.microsoft.com/en-us/azure/databricks/external-data/sql-server#query-the-remote-table>"],"metadata":{"application/vnd.databricks.v1+cell":{"showTitle":false,"cellMetadata":{},"nuid":"78fa288b-645a-4a76-b018-3cd06880c8ea","inputWidgets":{},"title":""}}},{"cell_type":"code","source":["customers = (\n spark.read.format(\"jdbc\")\n .option(\"driver\", driver)\n .option(\"url\", url)\n .option(\"dbtable\", \"MDW.Customer\")\n # .option(\"query\", \"select 1 as c1\")\n .option(\"user\", sqluser)\n .option(\"password\", sqlpassword)\n .load()\n)\n\ndisplay(customers)"],"metadata":{"application/vnd.databricks.v1+cell":{"showTitle":false,"cellMetadata":{},"nuid":"ee46b3a4-2331-4575-869b-167a0e8a85de","inputWidgets":{},"title":""}},"outputs":[],"execution_count":0},{"cell_type":"markdown","source":["### Read Product dimension"],"metadata":{"application/vnd.databricks.v1+cell":{"showTitle":false,"cellMetadata":{},"nuid":"dcfcfe3b-db0c-4a91-bd3f-186577df11f6","inputWidgets":{},"title":""}}},{"cell_type":"code","source":["products = (\n spark.read.format(\"jdbc\")\n .option(\"driver\", driver)\n .option(\"url\", url)\n .option(\"dbtable\", \"MDW.Product\")\n .option(\"user\", sqluser)\n .option(\"password\", sqlpassword)\n .load()\n)\n\ndisplay(products)"],"metadata":{"application/vnd.databricks.v1+cell":{"showTitle":false,"cellMetadata":{},"nuid":"77bdded4-279a-4720-a05f-23a5c6c25a2d","inputWidgets":{},"title":""}},"outputs":[],"execution_count":0},{"cell_type":"markdown","source":["## Add ODX ADLS mountpoint\n<https://learn.microsoft.com/en-us/azure/databricks/dbfs/mounts>"],"metadata":{"application/vnd.databricks.v1+cell":{"showTitle":false,"cellMetadata":{},"nuid":"cbf015ab-dd3c-4cab-930c-5ccd30b19133","inputWidgets":{},"title":""}}},{"cell_type":"code","source":["# add ADLS Gen 2 mountpoint\ndef add_mount_point(\n mountpointname, container, storageaccount, scope_name=\"azure-key-vault\"\n):\n client_id = dbutils.secrets.get(scope=\"azure-key-vault\", key=\"application-id\")\n client_secret = dbutils.secrets.get(scope=\"azure-key-vault\", key=\"application-secret\")\n tenant_id = dbutils.secrets.get(scope=\"azure-key-vault\", key=\"tenant-id\")\n\n configs = {\n \"fs.azure.account.auth.type\": \"OAuth\",\n \"fs.azure.account.oauth.provider.type\": \"org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider\",\n \"fs.azure.account.oauth2.client.id\": client_id,\n \"fs.azure.account.oauth2.client.secret\": client_secret,\n \"fs.azure.account.oauth2.client.endpoint\": f\"https://login.microsoftonline.com/{tenant_id}/oauth2/token\",\n }\n\n dbutils.fs.mount(\n source=\"abfss://\" + container + \"@\" + storageaccount + \".dfs.core.windows.net/\",\n mount_point=mountpointname,\n extra_configs=configs,\n )"],"metadata":{"application/vnd.databricks.v1+cell":{"showTitle":false,"cellMetadata":{},"nuid":"f05745a2-2753-4d49-aece-2a35653a3fde","inputWidgets":{},"title":""}},"outputs":[],"execution_count":0},{"cell_type":"code","source":["add_mount_points = [\n {\n \"mountpointname\": \"/mnt/odx-adls\",\n \"container\": dbutils.secrets.get(scope=\"azure-key-vault\", key=\"storage-container\"),\n \"storageaccount\": dbutils.secrets.get(scope=\"azure-key-vault\", key=\"storage-account\"),\n },\n]\n\nexisting_mount_points = [m.mountPoint for m in dbutils.fs.mounts()]\n\nfor mp in add_mount_points:\n if mp.get(\"mountpointname\") not in existing_mount_points:\n add_mount_point(**mp)\n else:\n print(f\"{mp.get('mountpointname')} already exists\")"],"metadata":{"application/vnd.databricks.v1+cell":{"showTitle":false,"cellMetadata":{},"nuid":"4e68500d-4722-457b-9c1c-dbeb253e1d46","inputWidgets":{},"title":""}},"outputs":[],"execution_count":0},{"cell_type":"code","source":["%fs\nls /mnt/odx-adls/AW2019/sales_salesorderheader/"],"metadata":{"application/vnd.databricks.v1+cell":{"showTitle":false,"cellMetadata":{},"nuid":"177200b0-9f17-4968-987e-6ec7907fffa7","inputWidgets":{},"title":""}},"outputs":[],"execution_count":0},{"cell_type":"markdown","source":["### find last version of ODX table"],"metadata":{"application/vnd.databricks.v1+cell":{"showTitle":false,"cellMetadata":{},"nuid":"779a7c6e-c9c2-4871-950a-a035a8e454ec","inputWidgets":{},"title":""}}},{"cell_type":"code","source":["all_versions = dbutils.fs.ls(\"/mnt/odx-adls/AW2019/sales_salesorderheader/\")\n\nlatest_version = max([v.path for v in all_versions if v.name != \"_WORK/\"])\n\nlatest_version_path = latest_version + \"DATA/\"\n\nprint(latest_version_path)"],"metadata":{"application/vnd.databricks.v1+cell":{"showTitle":false,"cellMetadata":{},"nuid":"46b6c944-b6f3-4e4f-b0c4-979928dd0ce7","inputWidgets":{},"title":""}},"outputs":[],"execution_count":0},{"cell_type":"code","source":["def get_latest_odx_version(table_folder):\n\n all_versions = dbutils.fs.ls(table_folder)\n latest_version = max([v.path for v in all_versions if v.name != \"_WORK/\"])\n latest_version_path = latest_version + \"DATA/\"\n\n return latest_version_path"],"metadata":{"application/vnd.databricks.v1+cell":{"showTitle":false,"cellMetadata":{},"nuid":"a3a434d8-7be3-442c-8762-f912a6d873bc","inputWidgets":{},"title":""}},"outputs":[],"execution_count":0},{"cell_type":"markdown","source":["### Read SalesOrderHeader"],"metadata":{"application/vnd.databricks.v1+cell":{"showTitle":false,"cellMetadata":{},"nuid":"c866aa48-0464-402b-af9d-d6b8245527cd","inputWidgets":{},"title":""}}},{"cell_type":"code","source":["salesorderheader = spark.read.parquet(\n get_latest_odx_version(\"/mnt/odx-adls/AW2019/sales_salesorderheader/\")\n)\n\ndisplay(salesorderheader)"],"metadata":{"application/vnd.databricks.v1+cell":{"showTitle":false,"cellMetadata":{},"nuid":"06c1cc27-d96f-4b29-b646-67379bc7fcdc","inputWidgets":{},"title":""}},"outputs":[],"execution_count":0},{"cell_type":"markdown","source":["### Read SalesOrderDetail"],"metadata":{"application/vnd.databricks.v1+cell":{"showTitle":false,"cellMetadata":{},"nuid":"d2c0ae52-5dc3-4648-973a-e2856ba68847","inputWidgets":{},"title":""}}},{"cell_type":"code","source":["salesorderdetail = spark.read.parquet(\n get_latest_odx_version(\"/mnt/odx-adls/AW2019/sales_salesorderdetail/\")\n)\n\ndisplay(salesorderdetail)"],"metadata":{"application/vnd.databricks.v1+cell":{"showTitle":false,"cellMetadata":{},"nuid":"aeab263b-1c2b-4084-8bc4-54a6fb132fe9","inputWidgets":{},"title":""}},"outputs":[],"execution_count":0},{"cell_type":"markdown","source":["## Create dataset for analysis"],"metadata":{"application/vnd.databricks.v1+cell":{"showTitle":false,"cellMetadata":{},"nuid":"e690b9ab-74d4-485a-a298-21761f3bbf51","inputWidgets":{},"title":""}}},{"cell_type":"code","source":["import pyspark.sql.functions as f\n\nmy_analysis = (\n salesorderdetail\n .join(salesorderheader, on=\"SalesOrderID\")\n .join(customers, on=\"CustomerID\")\n .join(products, on=\"ProductID\")\n .select(\n [\n \"OrderQty\",\n \"UnitPrice\",\n \"TerritoryName\",\n \"CountryRegionCode\",\n \"CategoryName\",\n \"SubcategoryName\",\n \"SalesOrderID\",\n \"OrderDate\"\n ]\n )\n .withColumn(\n \"Region-Territory\",\n f.concat_ws(\"-\", f.col(\"CountryRegionCode\"), f.col(\"TerritoryName\")),\n )\n .withColumn(\n \"Category-Subcategory\",\n f.concat_ws(\"-\", f.col(\"CategoryName\"), f.col(\"SubcategoryName\")),\n )\n .withColumn(\"GrossLineTotal\", f.col(\"OrderQty\") * f.col(\"UnitPrice\"))\n)\n\ndisplay(my_analysis)"],"metadata":{"application/vnd.databricks.v1+cell":{"showTitle":false,"cellMetadata":{},"nuid":"be17d55e-b75b-48f2-9aff-346550daee08","inputWidgets":{},"title":""}},"outputs":[],"execution_count":0},{"cell_type":"code","source":["# my_analysis.createOrReplaceGlobalTempView(\"my_analysis\")\nmy_analysis.write.saveAsTable(\"my_analysis\", mode=\"overwrite\")"],"metadata":{"application/vnd.databricks.v1+cell":{"showTitle":false,"cellMetadata":{},"nuid":"b2496cfd-1f14-4e99-8c61-f722a54af04f","inputWidgets":{},"title":""}},"outputs":[],"execution_count":0},{"cell_type":"markdown","source":["## Perform advanced analysis"],"metadata":{"application/vnd.databricks.v1+cell":{"showTitle":false,"cellMetadata":{},"nuid":"a7bbf34e-52a9-493d-a1e0-f4956eb2be07","inputWidgets":{},"title":""}}},{"cell_type":"code","source":["import ipywidgets as w\nimport seaborn as sns\n\n\n@w.interact(\n hue_by=[\n \"TerritoryName\",\n \"CountryRegionCode\",\n \"CategoryName\",\n \"SubcategoryName\",\n \"Region-Territory\",\n \"Category-Subcategory\",\n ],\n fit=True,\n)\ndef fun(hue_by=\"CategoryName\", fit=True):\n sns.lmplot(\n x=\"GrossLineTotal\",\n y=\"OrderCount\",\n hue=hue_by,\n fit_reg=fit,\n data=(\n my_analysis.groupBy(\n [\n \"TerritoryName\",\n \"CountryRegionCode\",\n \"CategoryName\",\n \"SubcategoryName\",\n \"Region-Territory\",\n \"Category-Subcategory\",\n \"OrderDate\"\n ]\n )\n .agg(\n f.sum(\"GrossLineTotal\").alias(\"GrossLineTotal\"),\n f.sum(\"OrderQty\").alias(\"OrderQty\"),\n f.avg(\"UnitPrice\").alias(\"UnitPrice\"),\n f.count(\"SalesOrderID\").alias(\"OrderCount\"),\n )\n .toPandas()\n ),\n )"],"metadata":{"application/vnd.databricks.v1+cell":{"showTitle":false,"cellMetadata":{},"nuid":"d4f5d029-068b-4b0e-8e1f-6f167949e36b","inputWidgets":{},"title":""}},"outputs":[],"execution_count":0},{"cell_type":"markdown","source":["## Resources and links\n* [Create an Azure Key Vault-backed secret scope](https://learn.microsoft.com/en-us/azure/databricks/security/secrets/secret-scopes#--create-an-azure-key-vault-backed-secret-scope)\n* [Query SQL Server with Azure Databricks](https://learn.microsoft.com/en-us/azure/databricks/external-data/sql-server)\n* [Mount ADLS Gen2 or Blob Storage with ABFS](https://learn.microsoft.com/en-us/azure/databricks/dbfs/mounts)\n* [Read Parquet files with Azure Databricks](https://learn.microsoft.com/en-us/azure/databricks/external-data/parquet)\n* [Apache PySpark DataFrame API](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/dataframe.html)\n* [Apache PySpark DataFrame join API](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.join.html#pyspark.sql.DataFrame.join)"],"metadata":{"application/vnd.databricks.v1+cell":{"showTitle":false,"cellMetadata":{},"nuid":"c0ec5ce8-b4cb-415d-86dd-91d90a2e29be","inputWidgets":{},"title":""}}}],"metadata":{"application/vnd.databricks.v1+notebook":{"notebookName":"01 - Use Data Science Tools with TimeXtender + ODX","dashboards":[],"notebookMetadata":{"pythonIndentUnit":4,"mostRecentlyExecutedCommandWithImplicitDF":{"commandId":2858263858763696,"dataframes":["_sqldf"]}},"language":"python","widgets":{},"notebookOrigID":2858263858763679}},"nbformat":4,"nbformat_minor":0} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment