Skip to content

Instantly share code, notes, and snippets.

@justinTM
Created March 2, 2022 03:57
Show Gist options
  • Save justinTM/d963c7d99ebc356d534955b88e533ca9 to your computer and use it in GitHub Desktop.
Save justinTM/d963c7d99ebc356d534955b88e533ca9 to your computer and use it in GitHub Desktop.
Apache Spark RDD parallelize pipe JSON file through jq multi-core
import os
from pyspark.sql import SparkSession
# create the spark session on a cluster of multiple cores
DIR_JSONS = '/tmp/in/jsons'
SPARK = SparkSession.builder.appName('APP_NAME').getOrCreate()
sc = SPARK.sparkContext
# execute a shell command using python
def os_shell_jq(filepath):
os.system(f"jq -c '.' '{filepath}' > '{filepath}'")
# get all JSON filepaths in a directory
jsons = [ os.path.join(DIR_JSONS, f) for f in listdir(DIR_JSONS) ]
# call the python function (shell command) for each string JSON filepath
rdd = sc.parallelize(jsons).foreach(os_shell_jq)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment