Skip to content

Instantly share code, notes, and snippets.

@alexott
Last active August 21, 2023 12:33
Show Gist options
  • Star 5 You must be signed in to star a gist
  • Fork 1 You must be signed in to fork a gist
  • Save alexott/001208f8f3c2ae5880b1d70ad64ddcfd to your computer and use it in GitHub Desktop.
Save alexott/001208f8f3c2ae5880b1d70ad64ddcfd to your computer and use it in GitHub Desktop.
AAD -> DBX synchronizer using Terraform

This gist contains Terraform code that allows to synchronize groups & users from AAD into the Databricks workspace, without need to setup SCIM connector.

This is an extended version of initial synchronizer implemented by Serge.

To start, download all .tf files to some place, and create a file terraform.tfvars with following content:

groups = {
  "My AAD group" = {
    workspace_access = true
    databricks_sql_access = true
    allow_cluster_create = true
    allow_instance_pool_create = false
    admin = false
  },
  "AAD group 2" = {
    .....
  }
}

Please store the state in remote location as described in documentation

// read group members of given groups from AzureAD every time Terraform is started
data "azuread_group" "this" {
for_each = local.all_groups
display_name = each.value
}
locals {
all_groups = toset(keys(var.groups))
admin_groups = toset([for k,v in var.groups: k if v.admin])
}
// create or remove groups within databricks - all governed by "groups" variable
resource "databricks_group" "this" {
for_each = data.azuread_group.this
display_name = each.key
external_id = data.azuread_group.this[each.key].object_id
workspace_access = var.groups[each.key].workspace_access
databricks_sql_access = var.groups[each.key].databricks_sql_access
allow_cluster_create = var.groups[each.key].allow_cluster_create
allow_instance_pool_create = var.groups[each.key].allow_instance_pool_create
force = true
}
locals {
all_members = toset(flatten([for group in values(data.azuread_group.this) : group.members] ))
}
// Extract information about real users
data "azuread_users" "users" {
ignore_missing = true
object_ids = local.all_members
}
locals {
all_users = {
for user in data.azuread_users.users.users: user.object_id => user
}
}
// all governed by AzureAD, create or remove users from databricks workspace
resource "databricks_user" "this" {
for_each = local.all_users
user_name = lower(local.all_users[each.key]["user_principal_name"])
display_name = local.all_users[each.key]["display_name"]
active = local.all_users[each.key]["account_enabled"]
external_id = each.key
force = true
}
// Provision Service Principals
data "azuread_service_principals" "spns" {
object_ids = toset(setsubtract(local.all_members, data.azuread_users.users.object_ids))
}
locals {
all_spns = {
for sp in data.azuread_service_principals.spns.service_principals: sp.object_id => sp
}
}
resource "databricks_service_principal" "sp" {
for_each = local.all_spns
application_id = local.all_spns[each.key]["application_id"]
display_name = local.all_spns[each.key]["display_name"]
active = local.all_spns[each.key]["account_enabled"]
external_id = each.key
force = true
}
locals {
merged_data = merge(databricks_user.this, databricks_service_principal.sp)
}
// put users to respective groups
resource "databricks_group_member" "this" {
for_each = toset(flatten([
for group, details in data.azuread_group.this : [
for member in details["members"] : jsonencode({
group = databricks_group.this[group].id,
member = local.merged_data[member].id
})
]
]))
group_id = jsondecode(each.value).group
member_id = jsondecode(each.value).member
}
// Provisioning Admins
data "azuread_group" "admins" {
for_each = local.admin_groups
display_name = each.value
}
data "databricks_group" "admins" {
display_name = "admins"
}
resource "databricks_group_member" "admins" {
for_each = toset(flatten([
for group, details in data.azuread_group.admins : [
for member in details["members"] : local.merged_data[member].id
]
]))
group_id = data.databricks_group.admins.id
member_id = each.value
}
terraform {
required_providers {
databricks = {
source = "databricks/databricks"
version = "1.1.0"
}
azuread = {
source = "hashicorp/azuread"
version = "2.22.0"
}
}
}
provider "azuread" {
# Configuration options
}
provider "databricks" {
}
# we can use optional + defaults, then it would be easier to use, but it's an experimental feature
variable "groups" {
description = "Map of AAD group names into object describing workspace & Databricks SQL access permissions"
type = map(object({
workspace_access = bool
databricks_sql_access = bool
allow_cluster_create = bool
allow_instance_pool_create = bool
admin = bool # if this group for Databricks admins
}))
}
# Create a variable in the terraform.tfvars with following content
# groups = {
# "AAD Group Name" = {
# workspace_access = true
# databricks_sql_access = false
# allow_cluster_create = false
# allow_instance_pool_create = false
# admin = false
# }
# }
@sivadotblog
Copy link

@alexott this is super helpful. I get the idea but I get these 2 errors. i am trying to solve for these errors as well

Error: Invalid resource ID
with data.databricks_group.admins
on aad-sync.tf line 90, in data "databricks_group" "admins":
data "databricks_group" "admins" {
Error: Invalid for_each argument
on aad-sync.tf line 72, in resource "databricks_group_member" "this":
  The "for_each" value depends on resource attributes that cannot be determined until apply, so Terraform cannot predict how many instances will be created. To work around this, use the -target argument to first apply only the resources that the for_each depends on.

@alexott
Copy link
Author

alexott commented Jul 27, 2022

Hmmm, I need to retest with the latest version. What version of databricks and terraform have you used?

@sivadotblog
Copy link

Hmmm, I need to retest with the latest version. What version of databricks and terraform have you used?

databricks version = "1.0.1"
Terraform v1.1.7
I overcame the for_each issue by issuing a target resource like this. but am yet to test its implications.

terraform plan -target="databricks_user.this"

@alexott
Copy link
Author

alexott commented Jul 28, 2022

@sivadotblog try the new version - for some reason behavior has changed for merge that was used inside the for_each... I did modification to handle it, and now everything works fine

@sivadotblog
Copy link

Hey @alexott I don’t believe this approach works in real world. Databricks group member resources need predefined list and cannot work on dynamic user and group ids. Aka “ known after apply.” You will see an error if you try adding a new ad group or if you convert this to a module and reusing it for a different workspace.I figured out a way to solve it . Ill share it when I log in

@alexott
Copy link
Author

alexott commented Aug 18, 2022

This version works for me :-)

@sivadotblog
Copy link

yes, i think this works for a single workspace. my case was a somewhat complex. I had to split this into 2 modules.
Module 1 - will create users and the groups
Module 2 - will associate the groups and Users - i had to do some lookups to identify the groups and then associate it or else I cant surpass this error The "for_each" value depends on resource attributes that cannot be determined until apply,

thanks

@alexott
Copy link
Author

alexott commented Aug 18, 2022

if you need 2nd module for associate, then you may need to use it in combination with the data sources...

@snowch
Copy link

snowch commented Sep 16, 2022

Please store the state in remote location as described in documentation

I know this is a terraform best practice, but is there a more specific reason you are asking users to do this when working with your gist?

@alexott
Copy link
Author

alexott commented Sep 16, 2022

@snowch it's just best practice, but when you have a lot of users, you won't want to do 2x time API calls (1 to submit a user/group and get error that user already exists + 1 to import existing user/group into state)

@snowch
Copy link

snowch commented Sep 16, 2022

I'm hitting an issue:

│ Error: Invalid for_each argument
│ 
│   on aad-sync.tf line 76, in resource "databricks_group_member" "this":
│   76:   for_each = toset(flatten([
│   77:     for group, details in data.azuread_group.this : [
│   78:       for member in details["members"] : jsonencode({
│   79:         group = databricks_group.this[group].id,
│   80:         member = local.merged_data[member].id
│   81:       })
│   82:     ]
│   83:   ]))
│     ├────────────────
│     │ data.azuread_group.this is object with 1 attribute "testgroup"
│     │ databricks_group.this is object with 1 attribute "testgroup"
│     │ local.merged_data is object with 1 attribute "xxxxx"
│ 
│ The "for_each" set includes values derived from resource attributes that cannot be determined until
│ apply, and so Terraform cannot determine the full set of keys that will identify the instances of
│ this resource.
│ 
│ When working with unknown values in for_each, it's better to use a map value where the keys are
│ defined statically in your configuration and where only the values contain apply-time results.
│ 
│ Alternatively, you could use the -target planning option to first apply only the resources that the
│ for_each value depends on, and then apply a second time to fully converge.
╵
╷
│ Error: Get "https://adb-xxxxxx.azuredatabricks.net/api/2.0/preview/scim/v2/Groups?filter=displayName%20eq%20%27admins%27": dial tcp: lookup adb-xxxxxx.azuredatabricks.net on xxxxxx:53: no such host
│ 
│   with data.databricks_group.admins,
│   on aad-sync.tf line 94, in data "databricks_group" "admins":
│   94: data "databricks_group" "admins" {
│ 
╵

@alexott
Copy link
Author

alexott commented Sep 16, 2022

are you creating the workspace in the same module? It makes sense to make it as a separate job

@snowch
Copy link

snowch commented Sep 16, 2022

Apologies for the naive question ... which should be run first, workspace or scim setup?

@alexott
Copy link
Author

alexott commented Sep 16, 2022

Workspace first...

@snowch
Copy link

snowch commented Sep 16, 2022

Still the same problem with workspace created first.

Here is the project - https://github.com/databricks/terraform_databricks_and_uc

@alexott
Copy link
Author

alexott commented Sep 16, 2022

Interesting - was working fine 2 months ago...

@snowch
Copy link

snowch commented Sep 19, 2022

I've refactored to do the user and group setup in two separate projects orchestrated by terragrunt.

Everything seems to be working ok now - Thanks!!

@heavy-metal-blues-code
Copy link

HI @alexott, it is a very useful gist. Thank you for sharing it.
Do you have something similar to assign users permission to a Databricks Unit Catalog?
If not, can you recommend a good approach for implementing it?

@Vykistwo
Copy link

Vykistwo commented Mar 9, 2023

Hi @alexott ,
I hope you can help.
I have copied your code as set out above but unfortunately getting an error when i run it in my pipeline.

Error: Invalid for_each argument │ │ on groups.tf line 81, in resource "databricks_group_member" "this": │ 81: for_each = toset(flatten([ │ 82: for group, details in data.azuread_group.this : [ │ 83: for member in details["members"] : jsonencode({ [0m 84: group = databricks_group.this[group].id, │ 85: member = local.merged_data[member].id │ 86: }) │ 87: ] │ 88: ])) │ ├──────────────── │ │ data.azuread_group.this is object with 2 attributes │ │ databricks_group.this is object with 2 attributes │ │ local.merged_data is object with 4 attributes │ │ The "for_each" set includes values derived from resource attributes that │ cannot be determined until apply, and so Terraform cannot determine the │ full set of keys that will identify the instances of this resource. │ │ When working with unknown values in for_each, it's better to use a map │ value where the keys are defined statically in your configuration and where │ only the values contain apply-time results. │ │ Alternatively, you could use the -target planning option to first apply │ only the resources that the for_each value depends on, and then apply a │ second time to fully converge. ╵ make: *** [Makefile:102: plan] Error 1 ##[error]╷ │ Error: Invalid for_each argument │ │ on groups.tf line 81, in resource "databricks_group_member" "this": │ 81: for_each = toset(flatten([ │ 82: for group, details in data.azuread_group.this : [ │ 83: for member in details["members"] : jsonencode({ [0m 84: group = databricks_group.this[group].id, │ 85: member = local.merged_data[member].id │ 86: })

I am using all the latest versions of providers and my vars are:
groups = { "Readers" = { workspace_access = true databricks_sql_access = true allow_cluster_create = true allow_instance_pool_create = false admin = false }, "DBRAdmins" = { workspace_access = true databricks_sql_access = true allow_cluster_create = true allow_instance_pool_create = false admin = false } }

@sivadotblog can you share the solution you have developed to above issue i have encountered please?

Any help would be greatly appreciated.

@sivadotblog
Copy link

I had to split this into 2 different terraform state files.

tf 1 will create the users and ad groups
tf 2 will assign the user membership to the corresponding to group.

that way, when part 2 runs, it will already know all the values. i did have to re-write this code quite a bit but the idea was the same.

@alexott
Copy link
Author

alexott commented Mar 10, 2023

Thank you @sivadotblog for answering. Yes, I need to update the recipe for it - it was a change in the terraform behavior at some point…

@sivadotblog
Copy link

We used this as an interm. AAD and Users cannot be considered as infrastructure. Before UC was GA, we had to create a SCIM for every workspace. this was a better solution than SCIM. Tbh Unity Catalog and account level scim is the way to go.

@Vykistwo
Copy link

I had to split this into 2 different terraform state files.

tf 1 will create the users and ad groups tf 2 will assign the user membership to the corresponding to group.

that way, when part 2 runs, it will already know all the values. i did have to re-write this code quite a bit but the idea was the same.

do you mind sharing the code?

@heavy-metal-blues-code
Copy link

Basically the first will create the AAD users, groups and service principals on the Databricks Workspace level.
Later on, you will need to fetch those via Terraform Data Sources and perform the ADB users and services principals assignments to their respective ADB groups.
Basically, where you see resource in the Terraform code you change it to data accordingly as per the HCP documentation.
The last trick is to use the following:

resource "databricks_group_member" "this" {
  for_each = toset(flatten([
    for group, details in data.azuread_group.this : [
      for member in details["members"] : jsonencode({
        group = data.databricks_group.this[group].id,
        member = local.merged_data[member].id
      })
    ]
  ]))
  group_id = jsondecode(each.value).group
  member_id = jsondecode(each.value).member

What I see as wrong in this approach is the fact you are creating local groups and not account groups.
Therefore, if you create account-level groups, you can spread them across Unit Catalog, Workspaces and External Locations as a single point of source of truth.

@sivadotblog
Copy link

sivadotblog commented Mar 16, 2023

PART 1

locals {
  all_groups = toset(keys(var.groups))
}

#Read AAD group & its members
data "azuread_group" "this" {
  for_each     = local.all_groups
  display_name = each.value
}

locals {
  all_group_members = toset(flatten([for group in values(data.azuread_group.this) : group.members]))
}

#Create those AAD Group in Azure Databricks with respective entitelements 

resource "databricks_group" "this" {
  for_each                   = data.azuread_group.this
  display_name               = each.key
  external_id                = data.azuread_group.this[each.key].object_id
  workspace_access           = var.groups[each.key].workspace_access
  databricks_sql_access      = var.groups[each.key].databricks_sql_access
  allow_cluster_create       = var.groups[each.key].allow_cluster_create
  allow_instance_pool_create = var.groups[each.key].allow_instance_pool_create
  force                      = true
}

locals {
  group_members = [
    for group, details in data.azuread_group.this : details.members
  ]
}

# Read AAD Member info from AAD Group Members
data "azuread_users" "users" {
  ignore_missing = true
  object_ids     = flatten(local.group_members)
}

locals {
  all_aad_users = {
    for user in distinct(data.azuread_users.users.users) : user.object_id => user
  }
}

#remove duplicates n all_aad_users




// all governed by AzureAD, create or remove users from databricks workspace

resource "databricks_user" "this" {
  for_each     = local.all_aad_users
  user_name    = lower(local.all_aad_users[each.key]["user_principal_name"])
  display_name = local.all_aad_users[each.key]["display_name"]
  active       = local.all_aad_users[each.key]["account_enabled"]
  external_id  = each.key
  force        = true
}
#SPN
locals {
  all_spns = toset(keys(var.spns))
}


data "databricks_group" "admins" {
  display_name = "admins"
}

resource "databricks_service_principal" "spn" {
  for_each       = local.all_spns
  application_id = each.key
}

resource "databricks_group_member" "admin_spn" {
  for_each  = databricks_service_principal.spn
  group_id  = data.databricks_group.admins.id
  member_id = each.value.id
  depends_on = [
    databricks_service_principal.spn
  ]
}

PART 2

locals {
  all_groups   = toset(keys(var.groups))
  admin_groups = toset([for k, v in var.groups : k if v.admin])
}
data "azuread_group" "this" {
  for_each     = local.all_groups
  display_name = each.value
}

locals {
  group_members = [
    for group, details in data.azuread_group.this : details.members
  ]
}
data "azuread_users" "users" {
  ignore_missing = true
  object_ids     = flatten(local.group_members)
}

data "databricks_group" "users" {
  display_name = "users"
}

data "databricks_user" "all_users" {
  for_each = data.databricks_group.users.users
  user_id  = each.value
}

locals {
  all_adb_user = {
    for user in data.databricks_user.all_users : user.user_name => user.id
  }
}
data "databricks_group" "all_groups" {
  for_each     = local.all_groups
  display_name = each.value
}

locals {
  adb_grp_id = {
    for group in data.databricks_group.all_groups : group.display_name => group.id
  }
}

locals {
  aad_user_emails = {
    for user in data.azuread_users.users.users : user.object_id => lower(user.user_principal_name)
  }
}


locals {
  az_grp_mem_map = toset(flatten([
    for group, details in data.azuread_group.this : [
      for member in details["members"] : {
        group  = lookup(local.adb_grp_id, group)
        member = lookup(local.all_adb_user, lookup(local.aad_user_emails, member, "not_found"), "not_found")
      }
    ]
  ]))

}

resource "databricks_group_member" "this" {

  for_each = { for data, member in local.az_grp_mem_map : data.member => data if member.member != "not_found" }

  group_id  = each.value.group
  member_id = each.value.member
}

data "azuread_group" "admins" {
  for_each     = local.admin_groups
  display_name = each.value
}

data "databricks_group" "admins" {
  display_name = "admins"
}

locals {
  az_admin_mem = toset(flatten([
    for group, details in data.azuread_group.admins : [
      for member in details["members"] : {
        group  = data.databricks_group.admins.id
        member = lookup(local.all_adb_user, lookup(local.aad_user_emails, member, "not_found"), "not_found")
      }
    ]
  ]))

}

resource "databricks_group_member" "admins" {

  for_each = { for data, member in local.az_admin_mem : data.member => data if member.member != "not_found" }

  group_id  = each.value.group
  member_id = each.value.member
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment