Skip to content

Instantly share code, notes, and snippets.

@mr1azl
Forked from joshlk/faster_toPandas.py
Created April 13, 2016 22:13
Show Gist options
  • Save mr1azl/dcc92d56507fe874c785b47a23bae441 to your computer and use it in GitHub Desktop.
Save mr1azl/dcc92d56507fe874c785b47a23bae441 to your computer and use it in GitHub Desktop.
PySpark faster toPandas using mapPartitions
import pandas as pd
def _map_to_pandas(rdds):
""" Needs to be here due to pickling issues """
return [pd.DataFrame(list(rdds))]
def toPandas(df, n_partitions=None):
"""
Returns the contents of `df` as a local `pandas.DataFrame` in a speedy fashion. The DataFrame is
repartitioned if `n_partitions` is passed.
:param df: pyspark.sql.DataFrame
:param n_partitions: int or None
:return: pandas.DataFrame
"""
if n_partitions is not None: df = df.repartition(n_partitions)
df_pand = df.rdd.mapPartitions(_map_to_pandas).collect()
df_pand = pd.concat(df_pand)
df_pand.columns = df.columns
return df_pand
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment