Skip to content

Instantly share code, notes, and snippets.

@dat-vikash
Forked from joshlk/faster_toPandas.py
Created February 27, 2018 12:52
Show Gist options
  • Save dat-vikash/e8f0bb9af99b20565c4fa7c22588fcc1 to your computer and use it in GitHub Desktop.
Save dat-vikash/e8f0bb9af99b20565c4fa7c22588fcc1 to your computer and use it in GitHub Desktop.
PySpark faster toPandas using mapPartitions
import pandas as pd
def _map_to_pandas(rdds):
""" Needs to be here due to pickling issues """
return [pd.DataFrame(list(rdds))]
def toPandas(df, n_partitions=None):
"""
Returns the contents of `df` as a local `pandas.DataFrame` in a speedy fashion. The DataFrame is
repartitioned if `n_partitions` is passed.
:param df: pyspark.sql.DataFrame
:param n_partitions: int or None
:return: pandas.DataFrame
"""
if n_partitions is not None: df = df.repartition(n_partitions)
df_pand = df.rdd.mapPartitions(_map_to_pandas).collect()
df_pand = pd.concat(df_pand)
df_pand.columns = df.columns
return df_pand
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment