Skip to content

Instantly share code, notes, and snippets.

@tomron
Created November 17, 2016 10:53
Show Gist options
  • Star 7 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save tomron/2c530f92694b2a06754d4356481addc8 to your computer and use it in GitHub Desktop.
Save tomron/2c530f92694b2a06754d4356481addc8 to your computer and use it in GitHub Desktop.
Converts parquet file to json using spark
# impor spark, set spark context
from pyspark import SparkContext, SparkConf
from pyspark.sql.context import SQLContext
import sys
import os
if len(sys.argv) == 1:
sys.stderr.write("Must enter input file to convert")
sys.exit()
input_file = sys.argv[1]
if len(sys.argv) >= 3:
output_path = os.path.join(
sys.argv[2], os.path.basename(input_file).split(".", 1)[0])
else:
output_path = os.path.join("to_json_" + input_file.split(".", 1)[0])
conf = SparkConf().setAppName(
"parquet_to_json_%{f}".format(f=input_file.split(".", 1)[0]))
sc = SparkContext(conf=conf)
sqlContext = SQLContext(sc)
# set sql context
parquetFile = sqlContext.read.parquet(input_file)
parquetFile.write.json(output_path)
@goelnishank9
Copy link

Can you please provide sample file along with this code.

@jcopps
Copy link

jcopps commented Jul 12, 2021

Thanks @tomron. This is really helpful.

@CoolOppo
Copy link

import argparse
import pandas as pd

parser = argparse.ArgumentParser(description="Convert Parquet to JSON")
parser.add_argument("input", help="Path to the input Parquet file")
parser.add_argument("output", help="Path to the output JSON file")

args = parser.parse_args()

df = pd.read_parquet(args.input)
df.to_json(args.output)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment