Skip to content

Instantly share code, notes, and snippets.

Databricks read from CSV

%py
path= ""
df = spark.read.csv(path, header=True)
df.cache()
df.createOrReplaceTempView("csv_data")
display(df)

Postgres enums

ERROR: when adding an enum to a table - column "my_column" contains null values

CREATE TYPE my_enum AS ENUM ('value1', 'value2');
ALTER TABLE some_table ADD COLUMN my_new_column my_enum NOT NULL;

Databricks Errors

ExecutionException: org.apache.spark.SparkException: Exception thrown in awaitResult: 
  • delta log - 403 forbidden - check permissions.
  • Works from a different cluster.
  • Data not permitted to be accessed from that cluster.

Running Airflow locally

sqlite No such table job

sqlite3.OperationalError: no such table: job

[SQL: INSERT INTO job (dag_id, state, job_type, start_date, end_date, latest_heartbeat, executor_class, hostname, unixname) VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?)]

Spark DDL

comments

alter table some_database.some_table column reason comment 'some comment';

Unclear - but can't add to if source data already delta and already exists? delta specified schema does not match

Python context managers

context managers provide - the 'with' construct

with blah:
  do-something()

Spark functions

dir(df)

['__class__',
'__delattr__',
'__dict__',
'__dir__',
'__doc__',
@jimathyp
jimathyp / databricks-spark.md
Last active September 21, 2022 05:41
Spark on Databricks

Spark on Databricks

Notebook comes with a spark session

print(dir())             # sc, spark, sql, sqlContext
print(type(spark))       # <class 'pyspark.sql.session.SparkSession'>
print(type(sc))          # <class 'dbruntime.spark_connection.RemoteContext'>
print(type(sql))         # <class 'method'>  Help on method sql in module pyspark.sql.context

AWS Vault errors

$ aws-vault exec some_profile -- ./some_bach_script.sh
aws-vault: error: exec: exec format error

Script was missing a shebang #!/bin/bash -> FIXED