Skip to content

Instantly share code, notes, and snippets.

@samuelsmal
Created October 11, 2016 14:10
Show Gist options
  • Star 26 You must be signed in to star a gist
  • Fork 4 You must be signed in to fork a gist
  • Save samuelsmal/feb86d4bdd9a658c122a706f26ba7e1e to your computer and use it in GitHub Desktop.
Save samuelsmal/feb86d4bdd9a658c122a706f26ba7e1e to your computer and use it in GitHub Desktop.
PySpark DataFrame filtering using a UDF and Regex
from pyspark.sql.functions import udf
from pyspark.sql.types import BooleanType
def regex_filter(x):
regexs = ['.*ALLYOURBASEBELONGTOUS.*']
if x and x.strip():
for r in regexs:
if re.match(r, x, re.IGNORECASE):
return True
return False
filter_udf = udf(regex_filter, BooleanType())
df_filtered = df.filter(filter_udf(df.field_to_filter_on))
@j-greer
Copy link

j-greer commented Dec 5, 2016

👍

@1danjordan
Copy link

This was very useful, cheers!

@jordanbonilla
Copy link

thank you for sharing

@aravindiiitb
Copy link

This is very useful, Thanks

@radhikagfg
Copy link

This was very useful, Thanks

@KwankiAhn
Copy link

Cool, this is exactly what I was looking for. Thanks.

@tejabhat
Copy link

thank you, this was helpful as its difficult to get sample codes on pyspark.

@jonasmodie
Copy link

Thanks , simple and straight foward

@danieljhegeman
Copy link

👍 @huzefa

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment