Created
March 8, 2018 00:31
-
-
Save karthikbgl/feec9ead8990f5cfc108cb0946a90b20 to your computer and use it in GitHub Desktop.
Saves a spark dataframe into a single csv/delimited file efficiently. Assumes the file storage to be hdfs
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
import subprocess | |
def write_to_local_fs(df): | |
""" | |
This method writes to local filesystem efficiently without using coalesce or repartition. | |
The idea is to persist data in cluster format in hdfs (or whatever file storage) and write to local file system. | |
Write header to the file in the local file system | |
:param: df: the dataframe being sent as argument | |
""" | |
hdfs_dir = "/path/to/some/valid_writeable/hdfs/directory" | |
local_file = "csv_output.csv" | |
df.write.mode("overwrite").format("com.databricks.spark.csv")\ | |
.options("header"="false", delimiter=",")\ | |
.save(hdfs_dir) | |
#and whatever options | |
subprocess.call("hdfs dfs -getmerge {} {}".format(hdfs_dir, local_file)) #This does not have header | |
#Your logic to write header into the csv file on local_file | |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment