Skip to content

Instantly share code, notes, and snippets.

@prakashrd
Created March 16, 2019 13:02
Show Gist options
  • Save prakashrd/b28b50d7f944228387cb5aba96df3b54 to your computer and use it in GitHub Desktop.
Save prakashrd/b28b50d7f944228387cb5aba96df3b54 to your computer and use it in GitHub Desktop.
PySpark read two files join on a column and print the result df
import sys
from pyspark.sql import SparkSession
# Import data types
from pyspark.sql.types import *
from pyspark.sql.functions import when, lit, col, udf
spark = SparkSession.builder.appName("Python spark read two files").getOrCreate()
accounts_file = sys.argv[1]
data_file = sys.argv[2]
account_df = spark.read.csv(accounts_file, header=True, inferSchema=True)
data_df = spark.read.csv(data_file, header=True, inferSchema=True)
result_df = account_df.join(data_df, "account_numbers")
result_df.show()
result_df.printSchema()
@prakashrd
Copy link
Author

prakashrd commented Mar 16, 2019

spark-submit args.py /tmp/accounts.csv /data.csv

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment