Skip to content

Instantly share code, notes, and snippets.

@marcolivierarsenault
Created April 12, 2019 01:47
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save marcolivierarsenault/2c30e48b28c5f61a678959fea7fe3277 to your computer and use it in GitHub Desktop.
Save marcolivierarsenault/2c30e48b28c5f61a678959fea7fe3277 to your computer and use it in GitHub Desktop.
from utils.scala_functions import find_matching_patterns
from pyspark.sql import functions as F
regexes = regex.agg(F.collect_list(F.col("pattern"))).collect()[0][0]
regexes = sc.broadcast(regexes)
articles = articles \
.withColumn("patterns", find_matching_patterns(F.col("text"), regexes.value)
.withColumn("patterns", F.when(F.col("patterns").isNull(), F.array(F.lit(None))).otherwise(F.col("patterns"))) \
.withColumn("pattern", F.explode(F.col("patterns")))
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment