Skip to content

Instantly share code, notes, and snippets.

@johnmuller87
johnmuller87 / comparison.csv
Created January 29, 2018 16:48
Comparison table
Type Time(s)
Python UDF 43.0632779598
Python Vectorized UDF 13.9144539833
Scala UDF 0.257154205
@johnmuller87
johnmuller87 / setup.sh
Created January 29, 2018 10:23
Create your jar and pass it to PySpark
# create the jar using SBT
sbt clean assembly
# Pass the jar to the PySpark session
pyspark --jars [path/to/jar/x.jar]
@johnmuller87
johnmuller87 / pyspark_example.py
Last active February 1, 2018 13:23
Using Scala UDF in Pyspark
# Pre Spark 2.1, use the tag 'pre-2.1'
spark._jvm.com.ing.wbaa.spark.udf.ValidateIBAN.registerUDF(spark._jsparkSession)
# Spark 2.1+, use the tag '2.1+'
from pyspark.sql.types import BooleanType
sqlContext.registerJavaFunction("validate_iban", "com.ing.wbaa.spark.udf.ValidateIBAN", BooleanType())
# Spark 2.3+ use the tag '2.1+'
from pyspark.sql.types import BooleanType
spark.udf.registerJavaFunction("validate_iban", "com.ing.wbaa.spark.udf.ValidateIBAN", BooleanType())
# Use your UDF!
@johnmuller87
johnmuller87 / ExampleUDF.scala
Last active February 1, 2018 13:43
Example UDF
package com.ing.wbaa.spark.udf
import org.apache.spark.sql.api.java.UDF1
import org.iban4j._
import scala.util.Try
/** Validate IBAN (Whitespace removed). If valid, no execption is thrown in IbanUtil and true is returned
* If Invalid, an exception is thrown and false is returned. If null, false is also returned.
*/
class ValidateIBAN extends UDF1[String, Boolean] {