Skip to content

Instantly share code, notes, and snippets.

@samuelsmal
Created October 11, 2016 14:10
Show Gist options
  • Star 26 You must be signed in to star a gist
  • Fork 4 You must be signed in to fork a gist
  • Save samuelsmal/feb86d4bdd9a658c122a706f26ba7e1e to your computer and use it in GitHub Desktop.
Save samuelsmal/feb86d4bdd9a658c122a706f26ba7e1e to your computer and use it in GitHub Desktop.
PySpark DataFrame filtering using a UDF and Regex
from pyspark.sql.functions import udf
from pyspark.sql.types import BooleanType
def regex_filter(x):
regexs = ['.*ALLYOURBASEBELONGTOUS.*']
if x and x.strip():
for r in regexs:
if re.match(r, x, re.IGNORECASE):
return True
return False
filter_udf = udf(regex_filter, BooleanType())
df_filtered = df.filter(filter_udf(df.field_to_filter_on))
@danieljhegeman
Copy link

👍 @huzefa

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment