Skip to content

Instantly share code, notes, and snippets.

@LanternD
Last active May 31, 2019 23:51
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save LanternD/1651d230f5c7b386b1f53c29ed97c85b to your computer and use it in GitHub Desktop.
Save LanternD/1651d230f5c7b386b1f53c29ed97c85b to your computer and use it in GitHub Desktop.
Handful Pyspark Functions

A list of useful pyspark functions that I used. A learning note.

Remove rows with NULL value (equivalent to empty string in csv)

non_empty_row_df = df.filter('your_column_name is not NULL')

Make modification to each row of the dataframe

Using RDD map (not tested)

def customFunction(row):
    row.name = row.name.strip()
    row.city = row.city + 'a'
    row.age = row.age + 1
    
    return (row.name, row.age, row.city)

df2 = df.rdd.map(customFunction)
df2.show(5)

Using UDF (user defined function)

def count_str_len(name_str):
    return len(name_str)

udf_count_str_len = udf(count_str_len, ShortType())

new_df = df.withColumn("keyword_length", udf_count_str_len(df['name']))
df.show(5)

Drop a column

new_df = df.drop("your_column_name")
new_df.show(5)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment