LanternD/handful_pyspark_functions.md

## handful_pyspark_functions.md

      
    Raw
  

              handful_pyspark_functions.md
            
          
    A list of useful pyspark functions that I used. A learning note.

Remove rows with NULL value (equivalent to empty string in csv)


Link

non_empty_row_df = df.filter('your_column_name is not NULL')
Make modification to each row of the dataframe

Using RDD map (not tested)


Link

def customFunction(row):
    row.name = row.name.strip()
    row.city = row.city + 'a'
    row.age = row.age + 1
    
    return (row.name, row.age, row.city)

df2 = df.rdd.map(customFunction)
df2.show(5)
Using UDF (user defined function)

def count_str_len(name_str):
    return len(name_str)

udf_count_str_len = udf(count_str_len, ShortType())

new_df = df.withColumn("keyword_length", udf_count_str_len(df['name']))
df.show(5)
Drop a column

new_df = df.drop("your_column_name")
new_df.show(5)