Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save karthikbgl/feec9ead8990f5cfc108cb0946a90b20 to your computer and use it in GitHub Desktop.
Save karthikbgl/feec9ead8990f5cfc108cb0946a90b20 to your computer and use it in GitHub Desktop.
Saves a spark dataframe into a single csv/delimited file efficiently. Assumes the file storage to be hdfs
import subprocess
def write_to_local_fs(df):
"""
This method writes to local filesystem efficiently without using coalesce or repartition.
The idea is to persist data in cluster format in hdfs (or whatever file storage) and write to local file system.
Write header to the file in the local file system
:param: df: the dataframe being sent as argument
"""
hdfs_dir = "/path/to/some/valid_writeable/hdfs/directory"
local_file = "csv_output.csv"
df.write.mode("overwrite").format("com.databricks.spark.csv")\
.options("header"="false", delimiter=",")\
.save(hdfs_dir)
#and whatever options
subprocess.call("hdfs dfs -getmerge {} {}".format(hdfs_dir, local_file)) #This does not have header
#Your logic to write header into the csv file on local_file
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment