Skip to content

Instantly share code, notes, and snippets.

@matthewseddon
Created March 22, 2017 10:58
Show Gist options
  • Save matthewseddon/2cf87d0690bfc58d2e9d29bc49b373c9 to your computer and use it in GitHub Desktop.
Save matthewseddon/2cf87d0690bfc58d2e9d29bc49b373c9 to your computer and use it in GitHub Desktop.
Concatenate two dfs in pyspark
"""Concatenate two dataframes in pyspark.
run as spark-submit concat_pyspark.py --py-files
"""
from __future__ import print_function
import pyspark
from pyspark import SparkContext
from pyspark.sql import SQLContext
logFile = "README.md" # Should be some file on your system
sc = SparkContext("local", "pyspark")
sqlContext = SQLContext(sc)
logData = sc.textFile(logFile).cache()
file_names = ['2017-01-{place}-street.csv'.format(place=p) for p in ('west-yorkshire', 'wiltshire')]
data_directory = 'police_data/2017-01/'
wy_df = sqlContext.read.format('com.databricks.spark.csv').options(header='true', inferschema='true').load('{}/{}'.format(data_directory, file_names[0]))
wi_df = sqlContext.read.format('com.databricks.spark.csv').options(header='true', inferschema='true').load('{}/{}'.format(data_directory, file_names[1]))
unified_df = wy_df.unionAll(wi_df)
print(unified_df.show())
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment