mifung/Analyzing Data in Azure Data Lake Storage Gen 2 using Databricks.md

## Analyzing Data in Azure Data Lake Storage Gen 2 using Databricks.md

      
    Raw
  

              Analyzing Data in Azure Data Lake Storage Gen 2 using Databricks.md
            
          
    Set up Data Lake Gen 2 in your Azure Subscription


Create a Storage Account
Select ‘BlobStorage’ for Account kind
Go to the Advanced tab and select ‘Data lake Storage Gen2’.  Select ‘Enabled Hierarchical Namespace’


Once the storage is created, go to Azure Storage Explorer, it will appear in the Storage Account under your subscription.
Create a filetype container in the blob container

Create a Service Principal in Databricks


Open up the Service Principal and record down the Secrets.
Create RBAC in Data Lake Storage Account in IAMs

Select Contributor Rights for the Databricks Service Principal

Create ACLs in Data Lake Storage Account in IAMs

After setting the RBAC for the Service Principal, go the Azure Storage Explore and enter the ACLS.
This step requires a few processes to complete.  In order to apply the ACLS to the Data Lake Gen 2, you must get the object_id.  One way of doing that is to run a PowerShell script to retrieve it.  Get the application id from the Databricks Service Principal and run the following command in Powershell.
az ad sp show --id < Application (client) ID from Databricks Service Principal in AAD>
Retrieve the object id from the Json output and record it.  Then go to the Azure Storage Explorer and click the Storage Account/Container.  Right click on the Container and select Manage Access….

Next, enter the Object_Id from the Service Principal and provide the Access privileges.
Analyzing Data in Azure Data Lake Storage Gen 2 using Databricks

This example is taken from the following link in Azure Documentation.  Click here 
Go to Databricks and open a Notebook

Run the following code and assign values from previous results.
Replace the following parameters:

< storage-account-name >  - Data Lake Storage Account Account name
< appID > - Databricks service principal application id
< password > - Databricks Service principal secret
< tenant-id > -  Directory id from Azure Active Directory
< container-name > - Data Lake Gen 2 container name
< file-system > - Data Lake Gen 2 container name

spark.conf.set("fs.azure.account.auth.type.<storage-account-name>.dfs.core.windows.net", "OAuth")
spark.conf.set("fs.azure.account.oauth.provider.type.<storage-account-name>.dfs.core.windows.net", "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider")
spark.conf.set("fs.azure.account.oauth2.client.id.<storage-account-name>.dfs.core.windows.net", "<appID>")
spark.conf.set("fs.azure.account.oauth2.client.secret.<storage-account-name>.dfs.core.windows.net", "<password>")   
spark.conf.set("fs.azure.account.oauth2.client.endpoint.<storage-account-name>.dfs.core.windows.net", "https://login.microsoftonline.com/<tenant-id>/oauth2/token")
spark.conf.set("fs.azure.createRemoteFileSystemDuringInitialization", "true")
dbutils.fs.ls("abfss://<container-name>@<storage-account-name>.dfs.core.windows.net/")
spark.conf.set("fs.azure.createRemoteFileSystemDuringInitialization", "false")

Load sample data into Data Lake Gen 2

%sh wget -P /tmp https://raw.githubusercontent.com/Azure/usql/master/Examples/Samples/Data/json/radiowebsite/small_radio_json.json 

dbutils.fs.cp("file:///tmp/small_radio_json.json", "abfss://<file-system>@<account-name>.dfs.core.windows.net/")

Databricks: Run a Spark SQL Job

%sql
DROP TABLE IF EXISTS radio_sample_data;
CREATE TABLE radio_sample_data
USING json
OPTIONS (
 path  "abfss://<container-name>@<storage-account-name>.dfs.core.windows.net/small_radio_json.json"
)

%sql
SELECT * from radio_sample_data