Created
March 11, 2024 14:09
-
-
Save agerwick/fe187b2acbd2144f87002995128cd53b to your computer and use it in GitHub Desktop.
Pandas Dataframe --> Apache Spark Dataframe
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Syntax comparison between Pandas and Apache Spark Dataframes | |
Creating DataFrames: | |
Pandas: | |
pd.DataFrame(data=None, index=None, columns=None, dtype=None, copy=False) | |
Spark: | |
spark.createDataFrame(data, schema=None, samplingRatio=None, verifySchema=True) | |
Aggregation and Grouping: | |
Pandas: | |
df.groupby('col').agg({'col2': 'mean', 'col3': 'sum'}) | |
df.groupby('col').col2.mean() | |
Spark: | |
df.groupBy('col').agg({'col2': 'avg', 'col3': 'sum'}) | |
df.groupBy('col').avg('col2') | |
Filtering and Selection: | |
Pandas: | |
df[df['col'] > 10] | |
df.loc[df['col'] > 10, 'col2'] | |
Spark: | |
df.filter(df['col'] > 10) | |
df.select('col2').where(df['col'] > 10) | |
Sorting: | |
Pandas: | |
df.sort_values(by='col') | |
Spark: | |
df.orderBy('col') | |
Joining and Merging: | |
Pandas: | |
pd.merge(df1, df2, on='col'): Performs an inner join with another DataFrame based on a common column. | |
pd.concat([df1, df2], axis=0): Concatenates DataFrames vertically (along rows). | |
df1.append(df2): Appends rows from df2 to df1. | |
Spark: | |
df1.join(other_df, on='col', how='inner'): Joins two DataFrames based on a common column (inner join by default). | |
df1.union(other_df): Concatenates DataFrames vertically (union of rows). | |
df1.unionByName(other_df): Union of DataFrames by column names (columns must match). | |
Shape (Dimensions): | |
Pandas: | |
df.shape: Returns a tuple representing the dimensions (rows, columns) of the DataFrame. | |
Spark: | |
To get the number of rows: df.count() | |
To get the number of columns: len(df.columns) | |
Column Names: | |
Pandas: | |
df.columns: Returns a list of column names. | |
Spark: | |
df.columns | |
Index (Row Labels): | |
Pandas: | |
df.index: Returns the index (row labels) of the DataFrame. | |
Spark: | |
Spark DataFrames do not have an explicit index like Pandas. Rows are identified by their position. | |
Head and Tail: | |
Pandas: | |
df.head(n=5): Returns the first n rows of the DataFrame. | |
df.tail(n=5): Returns the last n rows of the DataFrame. | |
Spark: | |
To get the first n rows: df.limit(n).show() | |
To get the last n rows (requires sorting): df.orderBy("some_column").limit(n).show() | |
Data Types: | |
Pandas: | |
df.dtypes: Returns the data types of each column. | |
Spark: | |
df.dtypes | |
Summary Statistics: | |
Pandas: | |
df.describe(): Generates summary statistics (count, mean, std, min, max, etc.) for numeric columns. | |
Spark: | |
df.summary().show() | |
Sample: | |
Pandas: | |
df.sample(n=5): Returns a random sample of n rows from the DataFrame. | |
Spark: | |
df.sample(False, fraction=0.1).show() | |
Drop Columns: | |
Pandas: | |
df.drop(columns=['col1', 'col2']): Removes specified columns from the DataFrame. | |
Spark: | |
df.drop("col1", "col2") | |
Rename Columns: | |
Pandas: | |
df.rename(columns={'old_col': 'new_col'}): Renames columns. | |
Spark: | |
df.withColumnRenamed("old_col", "new_col") | |
Set Index: | |
Pandas: | |
df.set_index('col'): Sets a column as the index. | |
Spark: | |
Spark DataFrames do not have an explicit index. Use other methods for row identification. | |
Reset Index: | |
Pandas: | |
df.reset_index(): Resets the index to default integer index. | |
Spark: | |
Spark DataFrames do not have an explicit index. No direct equivalent. | |
Sort Values: | |
Pandas: | |
df.sort_values(by='col'): Sorts the DataFrame by a specified column. | |
Spark: | |
df.orderBy("col").show() | |
Missing Values: | |
Pandas: | |
df.isnull(): Returns a DataFrame of Boolean values indicating missing (NaN) values. | |
df.notnull(): Returns a DataFrame of Boolean values indicating non-missing values. | |
Spark: | |
df.select([col(c).isNull().alias(c) for c in df.columns]).show() | |
Fill Missing Values: | |
Pandas: | |
df.fillna(value) or df.fillna(method='ffill'): Fills missing values with a specified value or forward-fill method. | |
Spark: | |
df.fillna(value, subset=["col1", "col2"]).show() | |
Drop Rows with Missing Values: | |
Pandas: | |
df.dropna(): Removes rows with any missing values. | |
Spark: | |
df.dropna() | |
Drop Duplicates: | |
Pandas: | |
df.drop_duplicates(subset=['col1', 'col2']): Removes duplicate rows based on specified columns. | |
Spark: | |
df.dropDuplicates(subset=['col1', 'col2']) | |
Apply a Function Element-Wise: | |
Pandas: | |
df.apply(func, axis=0): Applies a function along rows (axis=0) or columns (axis=1). | |
Spark: | |
from pyspark.sql.functions import udf | |
from pyspark.sql.types import IntegerType | |
def my_function(x): | |
# Your custom logic here | |
return x * 2 | |
my_udf = udf(my_function, IntegerType()) | |
df.withColumn('new_col', my_udf('col')).show() | |
Iterate Over Rows: | |
Pandas: | |
df.iterrows(): Iterates over rows as (index, Series) pairs. | |
Spark: | |
Spark DataFrames do not have an equivalent built-in method for row-wise iteration. | |
Consider using other Spark operations instead of explicit iteration. | |
Select Rows Based on Condition: | |
Pandas: | |
df.loc[condition]: Selects rows based on a condition. | |
Spark: | |
df.filter(condition).show() | |
Select Rows and Columns by Integer Location: | |
Pandas: | |
df.iloc[row_index, col_index]: Selects rows and columns by integer location. | |
Spark: | |
df.select(df.columns[col_index]).where(df.columns[row_index] == some_value).show() | |
Perform an Inner Join with Another DataFrame: | |
Pandas: | |
pd.merge(df1, df2, on='col') | |
Spark: | |
df1.join(df2, on='col', how='inner').show() | |
Joining DataFrames: | |
Pandas: | |
df.join(other_df, on='col', how='inner'): Joins two DataFrames based on a common column. | |
Spark: | |
df1.join(other_df, on='col', how='inner').show() | |
Creating a Pivot Table: | |
Pandas: | |
df.pivot(index='col1', columns='col2', values='col3'): Creates a pivot table. | |
Spark: | |
df.groupBy('col1').pivot('col2').agg({'col3': 'first'}).show() |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment