Last active
October 31, 2017 01:14
-
-
Save evanfrisch/10d578602692e404b950d3d72f2d699c to your computer and use it in GitHub Desktop.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
def create_categorical_feature(self, dataframe, base_field, categorical_field, levels, increment=0): | |
"""Produces a PySpark dataframe containing a categorical field based on a specified field. | |
:param dataframe: the PySpark dataframe | |
:param base_field: the field that provides the values used to create the categorical field | |
:param categorical_field: the name of the categorical field to be created | |
:param levels: the number of levels to be created in the categorical field | |
:param increment: the value to add to each level (Default value = 0) | |
:returns: the PySpark dataframe containing a categorical field and all fields in the supplied dataframe | |
""" | |
dataframe = self.fix_data_type(dataframe, [base_field], 'double') | |
discretizer = QuantileDiscretizer(numBuckets=levels, inputCol=base_field, outputCol=categorical_field) | |
dataframe = discretizer.fit(dataframe).transform(dataframe) | |
return(dataframe.withColumn(categorical_field, dataframe[categorical_field].cast('int')+increment)) |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment