Skip to content

Instantly share code, notes, and snippets.

@ovokpus
Last active July 15, 2022 05:26
Show Gist options
  • Save ovokpus/2e26a20240e616b8b13b31f7b6a0ca1a to your computer and use it in GitHub Desktop.
Save ovokpus/2e26a20240e616b8b13b31f7b6a0ca1a to your computer and use it in GitHub Desktop.
Azure Databricks notes

Getting Started with Azure Databricks

Technical Accomplishments:

  • Set the stage for learning on the Databricks platform
  • Create the cluster
  • Discover the workspace, import table data
  • Demonstrate how to develop & execute code within a notebook
  • Review the various "Magic Commands"
  • Introduce the Databricks File System (DBFS)

Attach notebook to your cluster

Before executing any cells in the notebook, you need to attach it to your cluster. Make sure that the cluster is running.

In the notebook's toolbar, select the drop down arrow next to Detached, and then select your cluster under Attach to.

Working with notebooks

A notebook is a web-based interface to a document that contains

  • runnable code
  • visualizations
  • descriptive text

To create a notebook, click on Workspace, browse into the desired folder, right click and choose Create then select Notebook.

A notebook contains multiple cells. Each cell has a specific type.

A default programming language is configured when creating the notebook and it will be implicitly used for new cells.

Magic commands We can override the cell's default programming language by using one of the following magic commands at the start of the cell:

  • %python for cells running python code
  • %scala for cells running scala code
  • %r for cells running R code
  • %sql for cells running sql code

Additional magic commands are available:

  • %md for descriptive cells using markdown
  • %sh for cells running shell commands
  • %run for cells running code defined in a separate notebook
  • %fs for cells running code that uses dbutils commands

To run a cell use one of the following options:

  • CTRL+ENTER or CMD+RETURN
  • SHIFT+ENTER or SHIFT+RETURN to run the cell and move to the next one
  • Using Run Cell, Run All Above or Run All Below as seen here

image

%python

#convert celsius to fahrenheit
def celsiusToFahrenheit(source_temp=None):
    return(source_temp * (9.0/5.0)) + 32.0    
        
#input values - celsius
a = [1, 2, 3, 4, 5]
print(a)

#convert all
b = map(lambda x: celsiusToFahrenheit(x), a)
print(list(b))

Databricks File System - DBFS

We've already imported data into Databricks by uploading our files.

Databricks is capable of mounting external/remote datasources as well.

DBFS allows you to mount storage objects so that you can seamlessly access data without requiring credentials. Allows you to interact with object storage using directory and file semantics instead of storage URLs. Persists files to object storage, so you won’t lose data after you terminate a cluster.

  • DBFS is a layer over a cloud-based object store
  • Files in DBFS are persisted to the object store
  • The lifetime of files in the DBFS are NOT tied to the lifetime of our cluster See also Databricks File System - DBFS.

Databricks Utilities - dbutils

  • You can access the DBFS through the Databricks Utilities class (and other file IO routines).
  • An instance of DBUtils is already declared for us as dbutils.

The mount command allows to use remote storage as if it were a local folder available in the Databricks workspace

dbutils.fs.mount(
  source = f"wasbs://dev@{data_storage_account_name}.blob.core.windows.net",
  mount_point = data_mount_point,
  extra_configs = {f"fs.azure.account.key.{data_storage_account_name}.blob.core.windows.net": data_storage_account_key})

To show available DBFS mounts:

%fs 
mounts

To show available tables:

%fs
ls /FileStore/tables

Additional help is available via dbutils.help() and for each sub-utility: dbutils.fs.help(), dbutils.meta.help(), dbutils.notebook.help(), dbutils.widgets.help().

See also Databricks Utilities - dbutils

dbutils.fs.help()

Working with data in Azure Databricks

Technical Accomplishments:

  • viewing available tables
  • loading table data in dataframes
  • loading file/dbfs data in dataframes
  • using spark for simple queries
  • using spark to show the data and its structure
  • using spark for complex queries
  • using Databricks' display for custom visualisations

Attach notebook to your cluster

Before executing any cells in the notebook, you need to attach it to your cluster. Make sure that the cluster is running.

In the notebook's toolbar, select the drop down arrow next to Detached, and then select your cluster under Attach to.

About Spark DataFrames

Spark DataFrames are distributed collections of data, organized into rows and columns, similar to traditional SQL tables.

A DataFrame can be operated on using relational transformations, through the Spark SQL API, which is available in Scala, Java, Python, and R.

We will use Python in our notebook.

We often refer to DataFrame variables using df.

Loading data into dataframes

View Available data

select * from nyc_taxi;

Reading data from our tables

Using Spark, we can read data into dataframes.

It is important to note that spark has read/write support for a widely set of formats. It can use

  • csv
  • json
  • parquet
  • orc
  • avro
  • hive tables
  • jdbc

We can read our data from the tables (since we already imported the initial csv as Databricks tables).

df = spark.sql("SELECT * FROM nyc_taxi")
display(df)

Reading data from the DBFS

We can also read the data from the original files we've uploaded; or indeed from any other file available in the DBFS.

The code is the same regardless of whether a file is local or in mounted remote storage that was mounted, thanks to DBFS mountpoints

df = spark.read.csv('dbfs:/FileStore/tables/nyc_taxi.csv', header=True, inferSchema=True)
display(df)

DataFrame size

Use count to determine how many rows of data we have in a dataframe.

df.count()

DataFrame structure

To get information about the schema associated with our dataframe we can use printSchema:

df.printSchema

show(..) vs display(..)

  • show(..) is part of core spark - display(..) is specific to our notebooks.
  • show(..) has parameters for truncating both columns and rows - display(..) does not.
  • show(..) is a function of the DataFrame/Dataset class - display(..) works with a number of different objects.
  • display(..) is more powerful - with it, you can...
  • Download the results as CSV
  • Render line charts, bar chart & other graphs, maps and more.
  • See up to 1000 records at a time.
  • For the most part, the difference between the two is going to come down to preference.

Remember, the display function is Databricks specific. It is not available in standard spark code.

Querying dataframes

Once that spark has the data, we can manipulate it using spark SQL API.

We can easily use the spark SQL dsl to do joins, aggregations, filtering. We can change the data structure, add or drop columns, or change the column types.

def celsiusToFahrenheit(source_temp=None):
    return(source_temp * (9.0/5.0)) + 32.0
  
celsiusToFahrenheit(27)

We will adapt it as a udf (user defined function) to make it usable with Spark's dataframes API.

And we will use it to enrich our source data.

from pyspark.sql.functions import isnan, when, count, col
from pyspark.sql.types import *

udfCelsiusToFahrenheit = udf(lambda z: celsiusToFahrenheit(z), DoubleType())

display(df.filter(col('temperature').isNotNull()) \
  .withColumn("tempC", col("temperature").cast(DoubleType())) \
  .select(col("tempC"), udfCelsiusToFahrenheit(col("tempC")).alias("tempF")))

More complex SQL functions are available in spark:

  • grouping, sorting, limits, count
  • aggregations: agg, max, sum
  • windowing: partitionBy, count over, max over

For example may want to add a row-number column to our source data. Window functions will help with such complex queries:

from pyspark.sql.window import Window
from pyspark.sql.functions import desc, row_number, monotonically_increasing_id

display(df.orderBy('tripDistance', ascending=False) \
  .withColumn('rowno', row_number().over(Window.orderBy(monotonically_increasing_id()))))

Data cleaning

Before using the source data, we have to validate the contents. Let's see if there are any duplicates:

df.count() - df.dropDuplicates().count()

Some columns might be missing. We check the presence of null values for each column.

display(df.select([count(when(col(c).isNull(), c)).alias(c) for c in df.columns]))

Since some of our columns seem to have such null values, we'll have to fix these rows.

We could either replace null values using fillna or ignore such rows using dropna

df = df.fillna({'passengerCount':'1'}).dropna()
display(df)

Explore Summary Statistics and Data Distribution

Predictive modeling is based on statistics and probability, so we should take a look at the summary statistics for the columns in our data. The describe function returns a dataframe containing the count, mean, standard deviation, minimum, and maximum values for each numeric column.

display(df.describe())

Visualizing data

Azure Databricks has custom support for displaying data.

The display(..) command has multiple capabilities:

  • up to 1000 records.
  • Exporting data as CSV.
  • Rendering a multitude of different graphs.
  • Rendering geo-located data on a world map.

Let's take a look at our data using databricks visualizations:

  • Run the cell below
  • click on the second icon underneath the executed cell and choose Bar
  • click on the Plot Options button to configure the graph
    • drag the tripDistance into the Keys list
    • drag the totalAmount into the Values list
    • choose Aggregation as AVG
    • click Apply
dfClean = df.select(col("tripDistance"), col("totalAmount")).dropna()

display(dfClean)

Note that the points form a diagonal line, which indicates a strong linear relationship between the trip distance and the total amount. This linear relationship shows a correlation between these two values, which we can measure statistically.

The corr function calculates a correlation value between -1 and 1, indicating the strength of correlation between two fields. A strong positive correlation (near 1) indicates that high values for one column are often found with high values for the other, which a strong negative correlation (near -1) indicates that low values for one column are often found with high values for the other. A correlation near 0 indicates little apparent relationship between the fields.

dfClean.corr('tripDistance', 'totalAmount')

Predictive modeling is largely based on statistical relationships between fields in the data. To design a good model, you need to understand how the data points relate to one another.

A common way to start exploring relationships is to create visualizations that compare two or more data values. For example, modify the Plot Options of the chart above to compare the arrival delays for each carrier:

  • Keys: temperature
  • Series Groupings: month_num
  • Values: snowDeprh
  • Aggregation: avg
  • Display Type: Line Chart
display(df)

The plot now shows the relation between the month, the snow amount and the recorded temperature.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment