A list of useful pyspark functions that I used. A learning note. Remove rows with NULL value (equivalent to empty string in csv) Link non_empty_row_df = df.filter('your_column_name is not NULL') Make modification to each row of the dataframe Using RDD map (not tested) Link def customFunction(row): row.name = row.name.strip() row.city = row.city + 'a' row.age = row.age + 1 return (row.name, row.age, row.city) df2 = df.rdd.map(customFunction) df2.show(5) Using UDF (user defined function) def count_str_len(name_str): return len(name_str) udf_count_str_len = udf(count_str_len, ShortType()) new_df = df.withColumn("keyword_length", udf_count_str_len(df['name'])) df.show(5) Drop a column new_df = df.drop("your_column_name") new_df.show(5)