Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Save datadu-de/c2be6b0458f0a46c9e24d6a9e5a66466 to your computer and use it in GitHub Desktop.
Save datadu-de/c2be6b0458f0a46c9e24d6a9e5a66466 to your computer and use it in GitHub Desktop.
{"cells":[{"cell_type":"markdown","source":["## Store secrets in Azure Key Vault\n<https://learn.microsoft.com/en-us/azure/databricks/security/secrets/secret-scopes#--create-an-azure-key-vault-backed-secret-scope>"],"metadata":{"application/vnd.databricks.v1+cell":{"showTitle":false,"cellMetadata":{},"nuid":"69c7c30b-bb87-480a-809c-bf3864d2a2ef","inputWidgets":{},"title":""}}},{"cell_type":"markdown","source":["## Connect to MDW (Azure SQL)\n<https://learn.microsoft.com/en-us/azure/databricks/external-data/sql-server>"],"metadata":{"application/vnd.databricks.v1+cell":{"showTitle":false,"cellMetadata":{},"nuid":"c3251b49-1370-4525-b190-552dbe0375f0","inputWidgets":{},"title":""}}},{"cell_type":"code","source":["driver = \"com.microsoft.sqlserver.jdbc.SQLServerDriver\"\n\ndatabase_host = dbutils.secrets.get(scope=\"azure-key-vault\", key=\"database-host\")\ndatabase_port = \"1433\" # update if you use a non-default port\ndatabase_name = dbutils.secrets.get(scope=\"azure-key-vault\", key=\"database-name\")\n\nsqluser = dbutils.secrets.get(scope=\"azure-key-vault\", key=\"sqluser\")\nsqlpassword = dbutils.secrets.get(scope=\"azure-key-vault\", key=\"sqlpassword\")\n\nurl = f\"jdbc:sqlserver://{database_host}:{database_port};databaseName={database_name}\""],"metadata":{"application/vnd.databricks.v1+cell":{"showTitle":false,"cellMetadata":{},"nuid":"920c56dc-d6fd-4ce0-bdfd-d85f1c4eeb01","inputWidgets":{},"title":""}},"outputs":[],"execution_count":0},{"cell_type":"markdown","source":["### Read Customer dimension\n<https://learn.microsoft.com/en-us/azure/databricks/external-data/sql-server#query-the-remote-table>"],"metadata":{"application/vnd.databricks.v1+cell":{"showTitle":false,"cellMetadata":{},"nuid":"78fa288b-645a-4a76-b018-3cd06880c8ea","inputWidgets":{},"title":""}}},{"cell_type":"code","source":["customers = (\n spark.read.format(\"jdbc\")\n .option(\"driver\", driver)\n .option(\"url\", url)\n .option(\"dbtable\", \"MDW.Customer\")\n # .option(\"query\", \"select 1 as c1\")\n .option(\"user\", sqluser)\n .option(\"password\", sqlpassword)\n .load()\n)\n\ndisplay(customers)"],"metadata":{"application/vnd.databricks.v1+cell":{"showTitle":false,"cellMetadata":{},"nuid":"ee46b3a4-2331-4575-869b-167a0e8a85de","inputWidgets":{},"title":""}},"outputs":[],"execution_count":0},{"cell_type":"markdown","source":["### Read Product dimension"],"metadata":{"application/vnd.databricks.v1+cell":{"showTitle":false,"cellMetadata":{},"nuid":"dcfcfe3b-db0c-4a91-bd3f-186577df11f6","inputWidgets":{},"title":""}}},{"cell_type":"code","source":["products = (\n spark.read.format(\"jdbc\")\n .option(\"driver\", driver)\n .option(\"url\", url)\n .option(\"dbtable\", \"MDW.Product\")\n .option(\"user\", sqluser)\n .option(\"password\", sqlpassword)\n .load()\n)\n\ndisplay(products)"],"metadata":{"application/vnd.databricks.v1+cell":{"showTitle":false,"cellMetadata":{},"nuid":"77bdded4-279a-4720-a05f-23a5c6c25a2d","inputWidgets":{},"title":""}},"outputs":[],"execution_count":0},{"cell_type":"markdown","source":["## Add ODX ADLS mountpoint\n<https://learn.microsoft.com/en-us/azure/databricks/dbfs/mounts>"],"metadata":{"application/vnd.databricks.v1+cell":{"showTitle":false,"cellMetadata":{},"nuid":"cbf015ab-dd3c-4cab-930c-5ccd30b19133","inputWidgets":{},"title":""}}},{"cell_type":"code","source":["# add ADLS Gen 2 mountpoint\ndef add_mount_point(\n mountpointname, container, storageaccount, scope_name=\"azure-key-vault\"\n):\n client_id = dbutils.secrets.get(scope=\"azure-key-vault\", key=\"application-id\")\n client_secret = dbutils.secrets.get(scope=\"azure-key-vault\", key=\"application-secret\")\n tenant_id = dbutils.secrets.get(scope=\"azure-key-vault\", key=\"tenant-id\")\n\n configs = {\n \"fs.azure.account.auth.type\": \"OAuth\",\n \"fs.azure.account.oauth.provider.type\": \"org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider\",\n \"fs.azure.account.oauth2.client.id\": client_id,\n \"fs.azure.account.oauth2.client.secret\": client_secret,\n \"fs.azure.account.oauth2.client.endpoint\": f\"https://login.microsoftonline.com/{tenant_id}/oauth2/token\",\n }\n\n dbutils.fs.mount(\n source=\"abfss://\" + container + \"@\" + storageaccount + \".dfs.core.windows.net/\",\n mount_point=mountpointname,\n extra_configs=configs,\n )"],"metadata":{"application/vnd.databricks.v1+cell":{"showTitle":false,"cellMetadata":{},"nuid":"f05745a2-2753-4d49-aece-2a35653a3fde","inputWidgets":{},"title":""}},"outputs":[],"execution_count":0},{"cell_type":"code","source":["add_mount_points = [\n {\n \"mountpointname\": \"/mnt/odx-adls\",\n \"container\": dbutils.secrets.get(scope=\"azure-key-vault\", key=\"storage-container\"),\n \"storageaccount\": dbutils.secrets.get(scope=\"azure-key-vault\", key=\"storage-account\"),\n },\n]\n\nexisting_mount_points = [m.mountPoint for m in dbutils.fs.mounts()]\n\nfor mp in add_mount_points:\n if mp.get(\"mountpointname\") not in existing_mount_points:\n add_mount_point(**mp)\n else:\n print(f\"{mp.get('mountpointname')} already exists\")"],"metadata":{"application/vnd.databricks.v1+cell":{"showTitle":false,"cellMetadata":{},"nuid":"4e68500d-4722-457b-9c1c-dbeb253e1d46","inputWidgets":{},"title":""}},"outputs":[],"execution_count":0},{"cell_type":"code","source":["%fs\nls /mnt/odx-adls/AW2019/sales_salesorderheader/"],"metadata":{"application/vnd.databricks.v1+cell":{"showTitle":false,"cellMetadata":{},"nuid":"177200b0-9f17-4968-987e-6ec7907fffa7","inputWidgets":{},"title":""}},"outputs":[],"execution_count":0},{"cell_type":"markdown","source":["### find last version of ODX table"],"metadata":{"application/vnd.databricks.v1+cell":{"showTitle":false,"cellMetadata":{},"nuid":"779a7c6e-c9c2-4871-950a-a035a8e454ec","inputWidgets":{},"title":""}}},{"cell_type":"code","source":["all_versions = dbutils.fs.ls(\"/mnt/odx-adls/AW2019/sales_salesorderheader/\")\n\nlatest_version = max([v.path for v in all_versions if v.name != \"_WORK/\"])\n\nlatest_version_path = latest_version + \"DATA/\"\n\nprint(latest_version_path)"],"metadata":{"application/vnd.databricks.v1+cell":{"showTitle":false,"cellMetadata":{},"nuid":"46b6c944-b6f3-4e4f-b0c4-979928dd0ce7","inputWidgets":{},"title":""}},"outputs":[],"execution_count":0},{"cell_type":"code","source":["def get_latest_odx_version(table_folder):\n\n all_versions = dbutils.fs.ls(table_folder)\n latest_version = max([v.path for v in all_versions if v.name != \"_WORK/\"])\n latest_version_path = latest_version + \"DATA/\"\n\n return latest_version_path"],"metadata":{"application/vnd.databricks.v1+cell":{"showTitle":false,"cellMetadata":{},"nuid":"a3a434d8-7be3-442c-8762-f912a6d873bc","inputWidgets":{},"title":""}},"outputs":[],"execution_count":0},{"cell_type":"markdown","source":["### Read SalesOrderHeader"],"metadata":{"application/vnd.databricks.v1+cell":{"showTitle":false,"cellMetadata":{},"nuid":"c866aa48-0464-402b-af9d-d6b8245527cd","inputWidgets":{},"title":""}}},{"cell_type":"code","source":["salesorderheader = spark.read.parquet(\n get_latest_odx_version(\"/mnt/odx-adls/AW2019/sales_salesorderheader/\")\n)\n\ndisplay(salesorderheader)"],"metadata":{"application/vnd.databricks.v1+cell":{"showTitle":false,"cellMetadata":{},"nuid":"06c1cc27-d96f-4b29-b646-67379bc7fcdc","inputWidgets":{},"title":""}},"outputs":[],"execution_count":0},{"cell_type":"markdown","source":["### Read SalesOrderDetail"],"metadata":{"application/vnd.databricks.v1+cell":{"showTitle":false,"cellMetadata":{},"nuid":"d2c0ae52-5dc3-4648-973a-e2856ba68847","inputWidgets":{},"title":""}}},{"cell_type":"code","source":["salesorderdetail = spark.read.parquet(\n get_latest_odx_version(\"/mnt/odx-adls/AW2019/sales_salesorderdetail/\")\n)\n\ndisplay(salesorderdetail)"],"metadata":{"application/vnd.databricks.v1+cell":{"showTitle":false,"cellMetadata":{},"nuid":"aeab263b-1c2b-4084-8bc4-54a6fb132fe9","inputWidgets":{},"title":""}},"outputs":[],"execution_count":0},{"cell_type":"markdown","source":["## Create dataset for analysis"],"metadata":{"application/vnd.databricks.v1+cell":{"showTitle":false,"cellMetadata":{},"nuid":"e690b9ab-74d4-485a-a298-21761f3bbf51","inputWidgets":{},"title":""}}},{"cell_type":"code","source":["import pyspark.sql.functions as f\n\nmy_analysis = (\n salesorderdetail\n .join(salesorderheader, on=\"SalesOrderID\")\n .join(customers, on=\"CustomerID\")\n .join(products, on=\"ProductID\")\n .select(\n [\n \"OrderQty\",\n \"UnitPrice\",\n \"TerritoryName\",\n \"CountryRegionCode\",\n \"CategoryName\",\n \"SubcategoryName\",\n \"SalesOrderID\",\n \"OrderDate\"\n ]\n )\n .withColumn(\n \"Region-Territory\",\n f.concat_ws(\"-\", f.col(\"CountryRegionCode\"), f.col(\"TerritoryName\")),\n )\n .withColumn(\n \"Category-Subcategory\",\n f.concat_ws(\"-\", f.col(\"CategoryName\"), f.col(\"SubcategoryName\")),\n )\n .withColumn(\"GrossLineTotal\", f.col(\"OrderQty\") * f.col(\"UnitPrice\"))\n)\n\ndisplay(my_analysis)"],"metadata":{"application/vnd.databricks.v1+cell":{"showTitle":false,"cellMetadata":{},"nuid":"be17d55e-b75b-48f2-9aff-346550daee08","inputWidgets":{},"title":""}},"outputs":[],"execution_count":0},{"cell_type":"code","source":["# my_analysis.createOrReplaceGlobalTempView(\"my_analysis\")\nmy_analysis.write.saveAsTable(\"my_analysis\", mode=\"overwrite\")"],"metadata":{"application/vnd.databricks.v1+cell":{"showTitle":false,"cellMetadata":{},"nuid":"b2496cfd-1f14-4e99-8c61-f722a54af04f","inputWidgets":{},"title":""}},"outputs":[],"execution_count":0},{"cell_type":"markdown","source":["## Perform advanced analysis"],"metadata":{"application/vnd.databricks.v1+cell":{"showTitle":false,"cellMetadata":{},"nuid":"a7bbf34e-52a9-493d-a1e0-f4956eb2be07","inputWidgets":{},"title":""}}},{"cell_type":"code","source":["import ipywidgets as w\nimport seaborn as sns\n\n\n@w.interact(\n hue_by=[\n \"TerritoryName\",\n \"CountryRegionCode\",\n \"CategoryName\",\n \"SubcategoryName\",\n \"Region-Territory\",\n \"Category-Subcategory\",\n ],\n fit=True,\n)\ndef fun(hue_by=\"CategoryName\", fit=True):\n sns.lmplot(\n x=\"GrossLineTotal\",\n y=\"OrderCount\",\n hue=hue_by,\n fit_reg=fit,\n data=(\n my_analysis.groupBy(\n [\n \"TerritoryName\",\n \"CountryRegionCode\",\n \"CategoryName\",\n \"SubcategoryName\",\n \"Region-Territory\",\n \"Category-Subcategory\",\n \"OrderDate\"\n ]\n )\n .agg(\n f.sum(\"GrossLineTotal\").alias(\"GrossLineTotal\"),\n f.sum(\"OrderQty\").alias(\"OrderQty\"),\n f.avg(\"UnitPrice\").alias(\"UnitPrice\"),\n f.count(\"SalesOrderID\").alias(\"OrderCount\"),\n )\n .toPandas()\n ),\n )"],"metadata":{"application/vnd.databricks.v1+cell":{"showTitle":false,"cellMetadata":{},"nuid":"d4f5d029-068b-4b0e-8e1f-6f167949e36b","inputWidgets":{},"title":""}},"outputs":[],"execution_count":0},{"cell_type":"markdown","source":["## Resources and links\n* [Create an Azure Key Vault-backed secret scope](https://learn.microsoft.com/en-us/azure/databricks/security/secrets/secret-scopes#--create-an-azure-key-vault-backed-secret-scope)\n* [Query SQL Server with Azure Databricks](https://learn.microsoft.com/en-us/azure/databricks/external-data/sql-server)\n* [Mount ADLS Gen2 or Blob Storage with ABFS](https://learn.microsoft.com/en-us/azure/databricks/dbfs/mounts)\n* [Read Parquet files with Azure Databricks](https://learn.microsoft.com/en-us/azure/databricks/external-data/parquet)\n* [Apache PySpark DataFrame API](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/dataframe.html)\n* [Apache PySpark DataFrame join API](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.join.html#pyspark.sql.DataFrame.join)"],"metadata":{"application/vnd.databricks.v1+cell":{"showTitle":false,"cellMetadata":{},"nuid":"c0ec5ce8-b4cb-415d-86dd-91d90a2e29be","inputWidgets":{},"title":""}}}],"metadata":{"application/vnd.databricks.v1+notebook":{"notebookName":"01 - Use Data Science Tools with TimeXtender + ODX","dashboards":[],"notebookMetadata":{"pythonIndentUnit":4,"mostRecentlyExecutedCommandWithImplicitDF":{"commandId":2858263858763696,"dataframes":["_sqldf"]}},"language":"python","widgets":{},"notebookOrigID":2858263858763679}},"nbformat":4,"nbformat_minor":0}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment