Skip to content

Instantly share code, notes, and snippets.

@schaunwheeler
Last active March 12, 2019 19:29
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save schaunwheeler/cfe596d8effb32270893789c16add117 to your computer and use it in GitHub Desktop.
Save schaunwheeler/cfe596d8effb32270893789c16add117 to your computer and use it in GitHub Desktop.
Data science productionizaton: scale - example 1.py
from pandas import DataFrame
from pyspark.sql import types as t, functions as f
df = DataFrame({'ids': [1, 2, 3], 'words': ['abracadabra', 'hocuspocus', 'shazam']})
sdf = sparkSession.createDataFrame(df)
normalize_word_udf = f.udf(normalize_word, t.StringType())
stops = f.array([f.lit(c) for c in STOPCHARS])
results = sdf.select('ids', normalize_word_udf(f.col('words'), stops).alias('norms'))
results.show()
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment