- Create a Storage Account
- Select ‘BlobStorage’ for Account kind
- Go to the Advanced tab and select ‘Data lake Storage Gen2’. Select ‘Enabled Hierarchical Namespace’
- Once the storage is created, go to Azure Storage Explorer, it will appear in the Storage Account under your subscription.
- Create a filetype container in the blob container
Open up the Service Principal and record down the Secrets.
Select Contributor Rights for the Databricks Service Principal
After setting the RBAC for the Service Principal, go the Azure Storage Explore and enter the ACLS. This step requires a few processes to complete. In order to apply the ACLS to the Data Lake Gen 2, you must get the object_id. One way of doing that is to run a PowerShell script to retrieve it. Get the application id from the Databricks Service Principal and run the following command in Powershell.
az ad sp show --id < Application (client) ID from Databricks Service Principal in AAD>
Retrieve the object id from the Json output and record it. Then go to the Azure Storage Explorer and click the Storage Account/Container. Right click on the Container and select Manage Access….
Next, enter the Object_Id from the Service Principal and provide the Access privileges.
This example is taken from the following link in Azure Documentation. Click here
Run the following code and assign values from previous results. Replace the following parameters:
- < storage-account-name > - Data Lake Storage Account Account name
- < appID > - Databricks service principal application id
- < password > - Databricks Service principal secret
- < tenant-id > - Directory id from Azure Active Directory
- < container-name > - Data Lake Gen 2 container name
- < file-system > - Data Lake Gen 2 container name
spark.conf.set("fs.azure.account.auth.type.<storage-account-name>.dfs.core.windows.net", "OAuth")
spark.conf.set("fs.azure.account.oauth.provider.type.<storage-account-name>.dfs.core.windows.net", "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider")
spark.conf.set("fs.azure.account.oauth2.client.id.<storage-account-name>.dfs.core.windows.net", "<appID>")
spark.conf.set("fs.azure.account.oauth2.client.secret.<storage-account-name>.dfs.core.windows.net", "<password>")
spark.conf.set("fs.azure.account.oauth2.client.endpoint.<storage-account-name>.dfs.core.windows.net", "https://login.microsoftonline.com/<tenant-id>/oauth2/token")
spark.conf.set("fs.azure.createRemoteFileSystemDuringInitialization", "true")
dbutils.fs.ls("abfss://<container-name>@<storage-account-name>.dfs.core.windows.net/")
spark.conf.set("fs.azure.createRemoteFileSystemDuringInitialization", "false")
%sh wget -P /tmp https://raw.githubusercontent.com/Azure/usql/master/Examples/Samples/Data/json/radiowebsite/small_radio_json.json
dbutils.fs.cp("file:///tmp/small_radio_json.json", "abfss://<file-system>@<account-name>.dfs.core.windows.net/")
%sql
DROP TABLE IF EXISTS radio_sample_data;
CREATE TABLE radio_sample_data
USING json
OPTIONS (
path "abfss://<container-name>@<storage-account-name>.dfs.core.windows.net/small_radio_json.json"
)
%sql
SELECT * from radio_sample_data