Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Save dheerajinampudi/5ea17e11fca42223d17920e8f9d6d6ac to your computer and use it in GitHub Desktop.
Save dheerajinampudi/5ea17e11fca42223d17920e8f9d6d6ac to your computer and use it in GitHub Desktop.
PySpark DataFrame filtering using a UDF and Regex
from pyspark.sql.functions import udf
from pyspark.sql.types import BooleanType
def regex_filter(x):
regexs = ['.*ALLYOURBASEBELONGTOUS.*']
if x and x.strip():
for r in regexs:
if re.match(r, x, re.IGNORECASE):
return True
return False
filter_udf = udf(regex_filter, BooleanType())
df_filtered = df.filter(filter_udf(df.field_to_filter_on))
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment