Skip to content

Instantly share code, notes, and snippets.

View vgkowski's full-sized avatar

Vincent Gromakowski vgkowski

  • Amazon Web Services
  • Lyon, France
View GitHub Profile
"""
AWS Glue Streaming Job: MSK → Iceberg with lag monitoring.
Reads from MSK via Spark Structured Streaming, writes to an Iceberg table
in the Glue Data Catalog, and tracks consumer lag via a dedicated Kafka
consumer group using the LagMonitorListener.
Usage:
Upload this file + streaming_lag_monitor.py to S3 and reference them
in your Glue job configuration (see infra/main.tf).
@vgkowski
vgkowski / spark-upsert.scala
Created January 6, 2020 20:07
spark code for upsert
import org.apache.spark.sql.expressions.Window
val oldData = spark.read.parquet(target+"/old")
.withColumn("last", lit(false))
oldData.cache
val newData = spark.read.parquet(target+"/updates")
.withColumn("last", lit(true))
.select(oldData.columns.map(col(_)):_*)
newData.cache