Skip to content

Instantly share code, notes, and snippets.

@kingkastle
Forked from zaloogarcia/pandas_to_spark.py
Created January 28, 2020 13:47
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save kingkastle/9f1ea2f8ed75c3df30aa43203e04ab59 to your computer and use it in GitHub Desktop.
Save kingkastle/9f1ea2f8ed75c3df30aa43203e04ab59 to your computer and use it in GitHub Desktop.
Script for converting Pandas DF to Spark's DF
from pyspark.sql.types import *
# Auxiliar functions
# Pandas Types -> Sparks Types
def equivalent_type(f):
if f == 'datetime64[ns]': return DateType()
elif f == 'int64': return LongType()
elif f == 'int32': return IntegerType()
elif f == 'float64': return FloatType()
else: return StringType()
def define_structure(string, format_type):
try: typo = equivalent_type(format_type)
except: typo = StringType()
return StructField(string, typo)
#Given pandas dataframe, it will return a spark's dataframe
def pandas_to_spark(df_pandas):
columns = list(df_pandas.columns)
types = list(df_pandas.dtypes)
struct_list = []
for column, typo in zip(columns, types):
struct_list.append(define_structure(column, typo))
p_schema = StructType(struct_list)
return sqlContext.createDataFrame(df_pandas, p_schema)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment