Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 1 You must be signed in to fork a gist
  • Save mifung/7019c8fe0be2a0467eb3ef45d2dceed5 to your computer and use it in GitHub Desktop.
Save mifung/7019c8fe0be2a0467eb3ef45d2dceed5 to your computer and use it in GitHub Desktop.
Analyzing Data in Azure Data Lake Storage Gen 2 using Databricks

Set up Data Lake Gen 2 in your Azure Subscription

  • Create a Storage Account
  • Select ‘BlobStorage’ for Account kind
  • Go to the Advanced tab and select ‘Data lake Storage Gen2’. Select ‘Enabled Hierarchical Namespace’

Create Storage Account

hierarchical

  • Once the storage is created, go to Azure Storage Explorer, it will appear in the Storage Account under your subscription.
  • Create a filetype container in the blob container

Create a Service Principal in Databricks

Service Principal

Open up the Service Principal and record down the Secrets.

Create RBAC in Data Lake Storage Account in IAMs

Select Contributor Rights for the Databricks Service Principal image

Create ACLs in Data Lake Storage Account in IAMs

After setting the RBAC for the Service Principal, go the Azure Storage Explore and enter the ACLS. This step requires a few processes to complete. In order to apply the ACLS to the Data Lake Gen 2, you must get the object_id. One way of doing that is to run a PowerShell script to retrieve it. Get the application id from the Databricks Service Principal and run the following command in Powershell.

az ad sp show --id < Application (client) ID from Databricks Service Principal in AAD>

Retrieve the object id from the Json output and record it. Then go to the Azure Storage Explorer and click the Storage Account/Container. Right click on the Container and select Manage Access….

Manage Access

Next, enter the Object_Id from the Service Principal and provide the Access privileges.

Analyzing Data in Azure Data Lake Storage Gen 2 using Databricks

This example is taken from the following link in Azure Documentation. Click here

Go to Databricks and open a Notebook

Run the following code and assign values from previous results. Replace the following parameters:

  • < storage-account-name > - Data Lake Storage Account Account name
  • < appID > - Databricks service principal application id
  • < password > - Databricks Service principal secret
  • < tenant-id > - Directory id from Azure Active Directory
  • < container-name > - Data Lake Gen 2 container name
  • < file-system > - Data Lake Gen 2 container name
spark.conf.set("fs.azure.account.auth.type.<storage-account-name>.dfs.core.windows.net", "OAuth")
spark.conf.set("fs.azure.account.oauth.provider.type.<storage-account-name>.dfs.core.windows.net", "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider")
spark.conf.set("fs.azure.account.oauth2.client.id.<storage-account-name>.dfs.core.windows.net", "<appID>")
spark.conf.set("fs.azure.account.oauth2.client.secret.<storage-account-name>.dfs.core.windows.net", "<password>")   
spark.conf.set("fs.azure.account.oauth2.client.endpoint.<storage-account-name>.dfs.core.windows.net", "https://login.microsoftonline.com/<tenant-id>/oauth2/token")
spark.conf.set("fs.azure.createRemoteFileSystemDuringInitialization", "true")
dbutils.fs.ls("abfss://<container-name>@<storage-account-name>.dfs.core.windows.net/")
spark.conf.set("fs.azure.createRemoteFileSystemDuringInitialization", "false")

Load sample data into Data Lake Gen 2

%sh wget -P /tmp https://raw.githubusercontent.com/Azure/usql/master/Examples/Samples/Data/json/radiowebsite/small_radio_json.json 
dbutils.fs.cp("file:///tmp/small_radio_json.json", "abfss://<file-system>@<account-name>.dfs.core.windows.net/")

Databricks: Run a Spark SQL Job

%sql
DROP TABLE IF EXISTS radio_sample_data;
CREATE TABLE radio_sample_data
USING json
OPTIONS (
 path  "abfss://<container-name>@<storage-account-name>.dfs.core.windows.net/small_radio_json.json"
)
%sql
SELECT * from radio_sample_data
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment