Skip to content

Instantly share code, notes, and snippets.

@kovid-r
Last active October 11, 2022 04:49
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save kovid-r/063d29e9817c657edbe530672530881c to your computer and use it in GitHub Desktop.
Save kovid-r/063d29e9817c657edbe530672530881c to your computer and use it in GitHub Desktop.
Create new columns in existing table PySpark Cheatsheet
# Create a column with the default value = 'xyz'
df = df.withColumn('new_column', F.lit('xyz'))
# Create a column with default value as null
df = df.withColumn('new_column', F.lit(None).cast(StringType()))
# Create a column using an existing column
df = df.withColumn('new_column', 1.4 * F.col('existing_column'))
# Another example using the MovieLens database
df = df.withColumn('test_col3', F.when(F.col('avg_ratings') < 7, 'OK')\
.when(F.col('avg_ratings') < 8, 'Good')\
.otherwise('Great')).show()
# Create a column using a UDF
def categorize(val):
if val < 150:
return 'bucket_1'
else:
return 'bucket_2'
my_udf = F.udf(categorize, StringType())
df = df.withColumn('new_column', categorize('existing_column'))
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment