Skip to content

Instantly share code, notes, and snippets.

@gcsfred
Created November 15, 2018 12:55
Show Gist options
  • Save gcsfred/cf8d620c6a0a913aac6b7f509bade5c1 to your computer and use it in GitHub Desktop.
Save gcsfred/cf8d620c6a0a913aac6b7f509bade5c1 to your computer and use it in GitHub Desktop.
define a pandas_udf annotated function that vectorizes a column of text from a DataFrame
import pandas as pd
from pyspark.sql.functions import pandas_udf, PandasUDFType
import spacy
#...
# nlp = spacy.load('en_core_web_lg')
nlp = spacy.load('en_core_web_sm')
#...
# Use pandas_udf to define a Pandas UDF
@pandas_udf('array<double>', PandasUDFType.SCALAR)
# The input is a pandas.Series with strings. The output is a pandas.Series of arrays of double.
def pandas_nlp(s):
return s.fillna("_NO_₦Ӑ_").replace('', '_NO_ӖӍΡṬΫ_').transform(lambda x: (nlp(x).vector.tolist()))
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment