Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Save agerwick/fe187b2acbd2144f87002995128cd53b to your computer and use it in GitHub Desktop.
Save agerwick/fe187b2acbd2144f87002995128cd53b to your computer and use it in GitHub Desktop.
Pandas Dataframe --> Apache Spark Dataframe
Syntax comparison between Pandas and Apache Spark Dataframes
Creating DataFrames:
Pandas:
pd.DataFrame(data=None, index=None, columns=None, dtype=None, copy=False)
Spark:
spark.createDataFrame(data, schema=None, samplingRatio=None, verifySchema=True)
Aggregation and Grouping:
Pandas:
df.groupby('col').agg({'col2': 'mean', 'col3': 'sum'})
df.groupby('col').col2.mean()
Spark:
df.groupBy('col').agg({'col2': 'avg', 'col3': 'sum'})
df.groupBy('col').avg('col2')
Filtering and Selection:
Pandas:
df[df['col'] > 10]
df.loc[df['col'] > 10, 'col2']
Spark:
df.filter(df['col'] > 10)
df.select('col2').where(df['col'] > 10)
Sorting:
Pandas:
df.sort_values(by='col')
Spark:
df.orderBy('col')
Joining and Merging:
Pandas:
pd.merge(df1, df2, on='col'): Performs an inner join with another DataFrame based on a common column.
pd.concat([df1, df2], axis=0): Concatenates DataFrames vertically (along rows).
df1.append(df2): Appends rows from df2 to df1.
Spark:
df1.join(other_df, on='col', how='inner'): Joins two DataFrames based on a common column (inner join by default).
df1.union(other_df): Concatenates DataFrames vertically (union of rows).
df1.unionByName(other_df): Union of DataFrames by column names (columns must match).
Shape (Dimensions):
Pandas:
df.shape: Returns a tuple representing the dimensions (rows, columns) of the DataFrame.
Spark:
To get the number of rows: df.count()
To get the number of columns: len(df.columns)
Column Names:
Pandas:
df.columns: Returns a list of column names.
Spark:
df.columns
Index (Row Labels):
Pandas:
df.index: Returns the index (row labels) of the DataFrame.
Spark:
Spark DataFrames do not have an explicit index like Pandas. Rows are identified by their position.
Head and Tail:
Pandas:
df.head(n=5): Returns the first n rows of the DataFrame.
df.tail(n=5): Returns the last n rows of the DataFrame.
Spark:
To get the first n rows: df.limit(n).show()
To get the last n rows (requires sorting): df.orderBy("some_column").limit(n).show()
Data Types:
Pandas:
df.dtypes: Returns the data types of each column.
Spark:
df.dtypes
Summary Statistics:
Pandas:
df.describe(): Generates summary statistics (count, mean, std, min, max, etc.) for numeric columns.
Spark:
df.summary().show()
Sample:
Pandas:
df.sample(n=5): Returns a random sample of n rows from the DataFrame.
Spark:
df.sample(False, fraction=0.1).show()
Drop Columns:
Pandas:
df.drop(columns=['col1', 'col2']): Removes specified columns from the DataFrame.
Spark:
df.drop("col1", "col2")
Rename Columns:
Pandas:
df.rename(columns={'old_col': 'new_col'}): Renames columns.
Spark:
df.withColumnRenamed("old_col", "new_col")
Set Index:
Pandas:
df.set_index('col'): Sets a column as the index.
Spark:
Spark DataFrames do not have an explicit index. Use other methods for row identification.
Reset Index:
Pandas:
df.reset_index(): Resets the index to default integer index.
Spark:
Spark DataFrames do not have an explicit index. No direct equivalent.
Sort Values:
Pandas:
df.sort_values(by='col'): Sorts the DataFrame by a specified column.
Spark:
df.orderBy("col").show()
Missing Values:
Pandas:
df.isnull(): Returns a DataFrame of Boolean values indicating missing (NaN) values.
df.notnull(): Returns a DataFrame of Boolean values indicating non-missing values.
Spark:
df.select([col(c).isNull().alias(c) for c in df.columns]).show()
Fill Missing Values:
Pandas:
df.fillna(value) or df.fillna(method='ffill'): Fills missing values with a specified value or forward-fill method.
Spark:
df.fillna(value, subset=["col1", "col2"]).show()
Drop Rows with Missing Values:
Pandas:
df.dropna(): Removes rows with any missing values.
Spark:
df.dropna()
Drop Duplicates:
Pandas:
df.drop_duplicates(subset=['col1', 'col2']): Removes duplicate rows based on specified columns.
Spark:
df.dropDuplicates(subset=['col1', 'col2'])
Apply a Function Element-Wise:
Pandas:
df.apply(func, axis=0): Applies a function along rows (axis=0) or columns (axis=1).
Spark:
from pyspark.sql.functions import udf
from pyspark.sql.types import IntegerType
def my_function(x):
# Your custom logic here
return x * 2
my_udf = udf(my_function, IntegerType())
df.withColumn('new_col', my_udf('col')).show()
Iterate Over Rows:
Pandas:
df.iterrows(): Iterates over rows as (index, Series) pairs.
Spark:
Spark DataFrames do not have an equivalent built-in method for row-wise iteration.
Consider using other Spark operations instead of explicit iteration.
Select Rows Based on Condition:
Pandas:
df.loc[condition]: Selects rows based on a condition.
Spark:
df.filter(condition).show()
Select Rows and Columns by Integer Location:
Pandas:
df.iloc[row_index, col_index]: Selects rows and columns by integer location.
Spark:
df.select(df.columns[col_index]).where(df.columns[row_index] == some_value).show()
Perform an Inner Join with Another DataFrame:
Pandas:
pd.merge(df1, df2, on='col')
Spark:
df1.join(df2, on='col', how='inner').show()
Joining DataFrames:
Pandas:
df.join(other_df, on='col', how='inner'): Joins two DataFrames based on a common column.
Spark:
df1.join(other_df, on='col', how='inner').show()
Creating a Pivot Table:
Pandas:
df.pivot(index='col1', columns='col2', values='col3'): Creates a pivot table.
Spark:
df.groupBy('col1').pivot('col2').agg({'col3': 'first'}).show()
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment