Skip to content

Instantly share code, notes, and snippets.

@isaacarnault
Last active April 26, 2024 14:03
Show Gist options
  • Save isaacarnault/948cef229bdab5837911159413bded2d to your computer and use it in GitHub Desktop.
Save isaacarnault/948cef229bdab5837911159413bded2d to your computer and use it in GitHub Desktop.
Deploy a 3 Databricks nodes cluster using Terraform on Azure

Project Status: Active – The project has reached a stable, usable state and is being actively developed.

terraform-azure.png

These artefacts are intented to help you provision a Databricks cluster on Azure using Infrastructure As Code (IaC).


The testing and GitHub documentation were technically performed by Isaac Arnault, EMEA Managing Director for Data, AI and Analytics at HUBIA (Consulting IT firm for Data, AI, BI and Analytics) in France. This gist is mainly dedicated to HUBIA's Clients' teams and its prospective customers. Follow Isaac Arnault on GitHub: https://isaacarnault.github.io/.


Without any further due let's get started. The below files (main.tf, variables.tf, terraform.tfvars) will provision a Databricks workspace, secret scope, and a cluster with specified configurations in your Azure subscription. Adjust the configurations as per your requirements.


The Azure architecture for the provisioned Databricks cluster would typically include the following components:

• Resource Group: This is the container for all resources associated with the Databricks workspace and cluster.

• Databricks Workspace: This is the Databricks environment where you can create and manage clusters, notebooks, libraries, jobs, and dashboards. It's a managed Spark environment.

• Databricks Cluster: This is the compute layer where your Spark jobs will run. It consists of multiple VMs (in this case, a 3-node cluster) of the specified size and autoscaling configuration.

• Virtual Network (VNet): You might have a VNet where your Databricks cluster is deployed for network isolation and security. It might include subnets for different purposes, such as cluster nodes, management, and gateway.

• Network Security Group (NSG): NSGs can be associated with subnets to control inbound and outbound traffic to the Databricks cluster. You may have rules to allow traffic only from specific sources, ports, or protocols.

• Managed Identity: Optionally, you might assign a managed identity to the Databricks cluster to authenticate with other Azure services securely.

• Storage Account: Databricks clusters often use Azure Storage for storing cluster logs, audit logs, and other metadata.

• Key Vault: You might use Azure Key Vault to securely store and manage secrets, such as Databricks access tokens or database connection strings.

• Azure Active Directory (AAD): Azure AD might be integrated with the Databricks workspace for user authentication and access control.

• Load Balancer (Optional): If your cluster is accessed by external clients or services, you might use Azure Load Balancer to distribute incoming traffic across the cluster nodes.


Below are some technical parameters you can pass within your Azure infrastructure with respect to the items above.

• Azure Storage Account Purpose: Azure Storage can be used as a data lake or as a storage solution for various data processing tasks in Databricks, such as storing input data, intermediate results, or output data. Configuration: In Terraform, you can use the azurerm_storage_account resource to provision a storage account. You can specify parameters such as storage account name, resource group, location, and account kind (e.g., StorageV2 for general-purpose storage). Additional Resources: You may also want to provision containers or file shares within the storage account using Terraform's azurerm_storage_container or azurerm_storage_share resources.

• Virtual Network (VNet) and Subnet Purpose: Deploying Databricks within a VNet provides network isolation and enables better control over network traffic and security. Configuration: You can use Terraform's azurerm_virtual_network resource to create a VNet and azurerm_subnet resource to define subnets within the VNet. Specify parameters such as address space, subnet CIDR block, and association with the Databricks cluster. Additional Considerations: Consider configuring network security groups (NSGs) and route tables for controlling inbound and outbound traffic between subnets and other network resources.

• Network Security Group (NSG): Purpose: NSGs allow you to filter network traffic to and from Azure resources in a VNet, providing an additional layer of security. Configuration: Use Terraform's azurerm_network_security_group resource to create an NSG. Define inbound and outbound security rules to allow or deny traffic based on source and destination IP addresses, ports, and protocols. Association: Associate the NSG with subnets using the network_security_group_id attribute in the azurerm_subnet resource.

• Azure Key Vault: Purpose: Azure Key Vault provides a secure way to store and manage sensitive information such as passwords, cryptographic keys, and secrets. Configuration: Use Terraform's azurerm_key_vault resource to create a Key Vault. Define access policies to specify who can access and manage secrets stored in the Key Vault. Secret Management: Use Terraform's azurerm_key_vault_secret resource to manage secrets within the Key Vault, such as Databricks access tokens or database connection strings.

• Azure Active Directory (AAD) Integration: Purpose: Integrating with Azure AD allows you to enforce authentication and access control policies based on user identities and group memberships. Configuration: Configure Azure AD integration in the Databricks workspace settings. You can enable single sign-on (SSO) and role-based access control (RBAC) to manage user access. Service Principal: You'll likely need to create a service principal in Azure AD and grant appropriate permissions to the Databricks workspace for accessing Azure resources.

• Monitoring and Logging: Purpose: Monitoring and logging solutions help track the performance, health, and usage of the Databricks cluster, enabling proactive troubleshooting and optimization. Configuration: Configure Azure Monitor to collect metrics and logs from the Databricks cluster. You can use Log Analytics to centralize log data and create custom queries and dashboards for monitoring. Integration: Databricks provides integration with Azure Monitor and Log Analytics, allowing you to send cluster metrics, application logs, and audit logs to Azure Monitor for analysis and visualization.


Best architecting practictes related to Azure Databricks include :

• Use a layered architecture: A layered architecture separates your data and workloads into different layers, such as a landing zone, a data lake, and a data warehouse. This makes it easier to manage your data and workloads, and it also improves performance and security.

• Use Delta Lake: Delta Lake is an open-source storage format that provides ACID transactions and other features that make it ideal for storing data in Azure Databricks. It is also compatible with Spark, so you can use existing Spark code to process and transform your data.

• Use autoscaling: Autoscaling allows Azure Databricks to automatically scale your clusters up or down based on the demand. This can help you to save money on compute costs.

• Use managed services: Azure Databricks provides a variety of managed services, such as managed notebooks and managed streaming. These services can help you to reduce the operational overhead of managing your Azure Databricks environment. Use security features: Azure Databricks provides a variety of security features, such as role-based access control (RBAC) and encryption. These features can help you to protect your data and workloads from unauthorized access.

Author

  • Isaac Arnault - Suggesting a way to deploy a databricks cluster on Azure

License

All public gists https://gist.github.com/aiPhD
Copyright 2024, Isaac Arnault
MIT License, http://www.opensource.org/licenses/mit-license.php

Cost considerations are based on a cluster running 5 hours daily on a monthly basis.

To estimate the monthly cost based on a Databricks cluster running for 5 hours daily, we'll need to consider the pricing for the Databricks cluster instance type (e.g., Standard_DS3_v2) and the number of nodes, as well as any associated Azure services like storage, Key Vault, and Azure AD integration. Let's break down the estimation:

• Databricks Cluster: Determine the hourly rate for the selected instance type and number of nodes. For example, let's say the hourly rate for a Standard_DS3_v2 instance is $0.5. Calculate the daily cost: 5 hours/day * $0.5/hour = $2.5/day Calculate the monthly cost: $2.5/day * 30 days = $75/month

• Azure Storage Account: Estimate the storage costs based on the amount of data stored in Azure Storage. Costs may vary depending on the storage tier (e.g., hot, cool, archive) and redundancy options (e.g., locally redundant storage, geo-redundant storage). Use the Azure Pricing Calculator to estimate the storage costs based on your specific requirements.

• Azure Key Vault: Estimate the cost based on the number of operations (e.g., read, write, delete) and the amount of stored data (secrets, keys, certificates). Use the Azure Pricing Calculator to estimate the cost based on your usage.

• Azure Active Directory (AAD) Integration: Determine if there are any additional costs for premium features or B2C/B2B scenarios. Use the Azure Pricing Calculator to estimate any additional costs based on your requirements.

• Monitoring and Logging: Estimate the cost based on the Azure Monitor and Log Analytics features used, such as data ingestion, retention, and log query volume. Use the Azure Pricing Calculator to estimate the cost based on your expected log data volume and features enabled.

To provision the resources, follow these steps:
isaac-arnault-databricks-terraform.png

MIT License
Copyright (c) 2024 Isaac Arnault, PhD
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
provider "azurerm" {
features {}
}
resource "azurerm_resource_group" "databricks_rg" {
name = "databricks-rg"
location = "East US" # Change to your desired Azure region
}
resource "azurerm_databricks_workspace" "example" {
name = "databricks-workspace"
resource_group_name = azurerm_resource_group.databricks_rg.name
location = azurerm_resource_group.databricks_rg.location
sku = "premium"
tags = {
environment = "production"
}
}
resource "azurerm_databricks_workspace_secret_scope" "example" {
workspace_resource_id = azurerm_databricks_workspace.example.id
name = "example"
}
resource "azurerm_databricks_cluster" "example" {
resource_group_name = azurerm_resource_group.databricks_rg.name
location = azurerm_resource_group.databricks_rg.location
workspace_name = azurerm_databricks_workspace.example.name
node_type_id = "Standard_DS3_v2" # Change to your desired VM size
spark_version = "7.3.x-scala2.12"
auto_scaling {
min_workers = 2
max_workers = 5
}
# Additional configuration options as needed
}
subscription_id = "YOUR_SUBSCRIPTION_ID"
client_id = "YOUR_CLIENT_ID"
client_secret = "YOUR_CLIENT_SECRET"
tenant_id = "YOUR_TENANT_ID"
# Make sure to replace placeholders like YOUR_SUBSCRIPTION_ID, YOUR_CLIENT_ID, YOUR_CLIENT_SECRET, and YOUR_TENANT_ID with your actual Azure subscription and service principal details.
variable "subscription_id" {
description = "The Azure subscription ID."
}
variable "client_id" {
description = "The Azure service principal client ID."
}
variable "client_secret" {
description = "The Azure service principal client secret."
}
variable "tenant_id" {
description = "The Azure tenant ID."
}
variable "location" {
description = "The Azure region in which to deploy resources."
default = "East US"
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment