Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Save nickefy/0a53513d560b406718f36675f50b9cd8 to your computer and use it in GitHub Desktop.
Save nickefy/0a53513d560b406718f36675f50b9cd8 to your computer and use it in GitHub Desktop.
How I Built a Data Lakehouse with Delta Lake Architecture
pip install delta-spark==2.4.0
pip install pyspark
import pyspark
from delta import *
builder = pyspark.sql.SparkSession.builder.appName("MyApp") \
.config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
.config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog")
spark = configure_spark_with_delta_pip(builder).getOrCreate()
# Create an RDD of Rows with ID and Amount
rdd = spark.sparkContext.parallelize([
Row(id=1, amount=100),
Row(id=2, amount=200),
Row(id=3, amount=300),
Row(id=4, amount=400),
Row(id=5, amount=500)
])
# Create a DataFrame from the RDD
df = spark.createDataFrame(rdd)
df.show()
# Write the DataFrame to a Delta table
delta_table_path = "/path/to/delta-table"
df.write.format("delta").mode("overwrite").save(delta_table_path)
# Read from delta table
df_read = spark.read.format("delta").load(delta_table_path)
df_read.show()
# Stop the Spark session
spark.stop()
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment