Skip to content

Instantly share code, notes, and snippets.

Embed
What would you like to do?
example for "Prediction at Scale with scikit-learn and PySpark Pandas UDFs" (https://medium.com/civis-analytics/prediction-at-scale-with-scikit-learn-and-pyspark-pandas-udfs-51d5ebfb2cd8)
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
@rmnka
Copy link

rmnka commented Nov 30, 2019

Hi, im getting some kind of pyspark error on your code. I guess it's because of libs incompatibility.
Could you pls provide python, pandas and pyspark versions, which worked for u, thx

@mheilman
Copy link
Author

mheilman commented Dec 2, 2019

Hi, sorry about not including version numbers in there. I added them just now.

I suspect the pandas or pyarrow version was causing trouble because I had to use some older versions of those to get this notebook to run just now. I didn't look into it deeply, but it looks like the latest version of pyspark at the moment (2.4.4) doesn't work well with the latest versions of pandas and pyarrow. This apparently has been addressed for future releases (see, e.g., apache/spark#24867).

@rmnka
Copy link

rmnka commented Dec 2, 2019

thank you so much for your quick response, it worked for me as i downgraded pyarrow to 0.8.0

@jamesonl
Copy link

jamesonl commented Dec 3, 2019

Hi - this code is helpful for applying an already trained model at scale... but is it possible to train a model at scale using pandas_UDF functions?

Another way of asking the same question: Is it possible to include the section called "Train a model with scikit-learn" within a pandas_UDF?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment