agerwick/pandas_pyspark_dataframe_syntax_comparison.txt

## pandas_pyspark_dataframe_syntax_comparison.txt
Syntax comparison between Pandas and Apache Spark Dataframes

Creating DataFrames:
Pandas:
pd.DataFrame(data=None, index=None, columns=None, dtype=None, copy=False)
Spark:
spark.createDataFrame(data, schema=None, samplingRatio=None, verifySchema=True)

Aggregation and Grouping:
Pandas:
df.groupby('col').agg({'col2': 'mean', 'col3': 'sum'})
df.groupby('col').col2.mean()
Spark:
df.groupBy('col').agg({'col2': 'avg', 'col3': 'sum'})
df.groupBy('col').avg('col2')

Filtering and Selection:
Pandas:
df[df['col'] > 10]
df.loc[df['col'] > 10, 'col2']
Spark:
df.filter(df['col'] > 10)
df.select('col2').where(df['col'] > 10)

Sorting:
Pandas:
df.sort_values(by='col')
Spark:
df.orderBy('col')

Joining and Merging:
Pandas:
pd.merge(df1, df2, on='col'): Performs an inner join with another DataFrame based on a common column.
pd.concat([df1, df2], axis=0): Concatenates DataFrames vertically (along rows).
df1.append(df2): Appends rows from df2 to df1.
Spark:
df1.join(other_df, on='col', how='inner'): Joins two DataFrames based on a common column (inner join by default).
df1.union(other_df): Concatenates DataFrames vertically (union of rows).
df1.unionByName(other_df): Union of DataFrames by column names (columns must match).

Shape (Dimensions):
Pandas:
df.shape: Returns a tuple representing the dimensions (rows, columns) of the DataFrame.
Spark:
To get the number of rows: df.count()
To get the number of columns: len(df.columns)

Column Names:
Pandas:
df.columns: Returns a list of column names.
Spark:
df.columns

Index (Row Labels):
Pandas:
df.index: Returns the index (row labels) of the DataFrame.
Spark:
Spark DataFrames do not have an explicit index like Pandas. Rows are identified by their position.

Head and Tail:
Pandas:
df.head(n=5): Returns the first n rows of the DataFrame.
df.tail(n=5): Returns the last n rows of the DataFrame.
Spark:
To get the first n rows: df.limit(n).show()
To get the last n rows (requires sorting): df.orderBy("some_column").limit(n).show()

Data Types:
Pandas:
df.dtypes: Returns the data types of each column.
Spark:
df.dtypes

Summary Statistics:
Pandas:
df.describe(): Generates summary statistics (count, mean, std, min, max, etc.) for numeric columns.
Spark:
df.summary().show()

Sample:
Pandas:
df.sample(n=5): Returns a random sample of n rows from the DataFrame.
Spark:
df.sample(False, fraction=0.1).show()

Drop Columns:
Pandas:
df.drop(columns=['col1', 'col2']): Removes specified columns from the DataFrame.
Spark:
df.drop("col1", "col2")

Rename Columns:
Pandas:
df.rename(columns={'old_col': 'new_col'}): Renames columns.
Spark:
df.withColumnRenamed("old_col", "new_col")

Set Index:
Pandas:
df.set_index('col'): Sets a column as the index.
Spark:
Spark DataFrames do not have an explicit index. Use other methods for row identification.

Reset Index:
Pandas:
df.reset_index(): Resets the index to default integer index.
Spark:
Spark DataFrames do not have an explicit index. No direct equivalent.

Sort Values:
Pandas:
df.sort_values(by='col'): Sorts the DataFrame by a specified column.
Spark:
df.orderBy("col").show()

Missing Values:
Pandas:
df.isnull(): Returns a DataFrame of Boolean values indicating missing (NaN) values.
df.notnull(): Returns a DataFrame of Boolean values indicating non-missing values.
Spark:
df.select([col(c).isNull().alias(c) for c in df.columns]).show()

Fill Missing Values:
Pandas:
df.fillna(value) or df.fillna(method='ffill'): Fills missing values with a specified value or forward-fill method.
Spark:
df.fillna(value, subset=["col1", "col2"]).show()

Drop Rows with Missing Values:
Pandas:
df.dropna(): Removes rows with any missing values.
Spark:
df.dropna()

Drop Duplicates:
Pandas:
df.drop_duplicates(subset=['col1', 'col2']): Removes duplicate rows based on specified columns.
Spark:
df.dropDuplicates(subset=['col1', 'col2'])

Apply a Function Element-Wise:
Pandas:
df.apply(func, axis=0): Applies a function along rows (axis=0) or columns (axis=1).
Spark:
from pyspark.sql.functions import udf
from pyspark.sql.types import IntegerType

def my_function(x):
    # Your custom logic here
    return x * 2

my_udf = udf(my_function, IntegerType())
df.withColumn('new_col', my_udf('col')).show()

Iterate Over Rows:
Pandas:
df.iterrows(): Iterates over rows as (index, Series) pairs.
Spark:
Spark DataFrames do not have an equivalent built-in method for row-wise iteration.
Consider using other Spark operations instead of explicit iteration.

Select Rows Based on Condition:
Pandas:
df.loc[condition]: Selects rows based on a condition.
Spark:
df.filter(condition).show()

Select Rows and Columns by Integer Location:
Pandas:
df.iloc[row_index, col_index]: Selects rows and columns by integer location.
Spark:
df.select(df.columns[col_index]).where(df.columns[row_index] == some_value).show()

Perform an Inner Join with Another DataFrame:
Pandas:
pd.merge(df1, df2, on='col')
Spark:
df1.join(df2, on='col', how='inner').show()

Joining DataFrames:
Pandas:
df.join(other_df, on='col', how='inner'): Joins two DataFrames based on a common column.
Spark:
df1.join(other_df, on='col', how='inner').show()

Creating a Pivot Table:
Pandas:
df.pivot(index='col1', columns='col2', values='col3'): Creates a pivot table.
Spark:
df.groupBy('col1').pivot('col2').agg({'col3': 'first'}).show()
	Syntax comparison between Pandas and Apache Spark Dataframes

	Creating DataFrames:
	Pandas:
	pd.DataFrame(data=None, index=None, columns=None, dtype=None, copy=False)
	Spark:
	spark.createDataFrame(data, schema=None, samplingRatio=None, verifySchema=True)

	Aggregation and Grouping:
	Pandas:
	df.groupby('col').agg({'col2': 'mean', 'col3': 'sum'})
	df.groupby('col').col2.mean()
	Spark:
	df.groupBy('col').agg({'col2': 'avg', 'col3': 'sum'})
	df.groupBy('col').avg('col2')

	Filtering and Selection:
	Pandas:
	df[df['col'] > 10]
	df.loc[df['col'] > 10, 'col2']
	Spark:
	df.filter(df['col'] > 10)
	df.select('col2').where(df['col'] > 10)

	Sorting:
	Pandas:
	df.sort_values(by='col')
	Spark:
	df.orderBy('col')

	Joining and Merging:
	Pandas:
	pd.merge(df1, df2, on='col'): Performs an inner join with another DataFrame based on a common column.
	pd.concat([df1, df2], axis=0): Concatenates DataFrames vertically (along rows).
	df1.append(df2): Appends rows from df2 to df1.
	Spark:
	df1.join(other_df, on='col', how='inner'): Joins two DataFrames based on a common column (inner join by default).
	df1.union(other_df): Concatenates DataFrames vertically (union of rows).
	df1.unionByName(other_df): Union of DataFrames by column names (columns must match).

	Shape (Dimensions):
	Pandas:
	df.shape: Returns a tuple representing the dimensions (rows, columns) of the DataFrame.
	Spark:
	To get the number of rows: df.count()
	To get the number of columns: len(df.columns)

	Column Names:
	Pandas:
	df.columns: Returns a list of column names.
	Spark:
	df.columns

	Index (Row Labels):
	Pandas:
	df.index: Returns the index (row labels) of the DataFrame.
	Spark:
	Spark DataFrames do not have an explicit index like Pandas. Rows are identified by their position.

	Head and Tail:
	Pandas:
	df.head(n=5): Returns the first n rows of the DataFrame.
	df.tail(n=5): Returns the last n rows of the DataFrame.
	Spark:
	To get the first n rows: df.limit(n).show()
	To get the last n rows (requires sorting): df.orderBy("some_column").limit(n).show()

	Data Types:
	Pandas:
	df.dtypes: Returns the data types of each column.
	Spark:
	df.dtypes

	Summary Statistics:
	Pandas:
	df.describe(): Generates summary statistics (count, mean, std, min, max, etc.) for numeric columns.
	Spark:
	df.summary().show()

	Sample:
	Pandas:
	df.sample(n=5): Returns a random sample of n rows from the DataFrame.
	Spark:
	df.sample(False, fraction=0.1).show()

	Drop Columns:
	Pandas:
	df.drop(columns=['col1', 'col2']): Removes specified columns from the DataFrame.
	Spark:
	df.drop("col1", "col2")

	Rename Columns:
	Pandas:
	df.rename(columns={'old_col': 'new_col'}): Renames columns.
	Spark:
	df.withColumnRenamed("old_col", "new_col")

	Set Index:
	Pandas:
	df.set_index('col'): Sets a column as the index.
	Spark:
	Spark DataFrames do not have an explicit index. Use other methods for row identification.

	Reset Index:
	Pandas:
	df.reset_index(): Resets the index to default integer index.
	Spark:
	Spark DataFrames do not have an explicit index. No direct equivalent.

	Sort Values:
	Pandas:
	df.sort_values(by='col'): Sorts the DataFrame by a specified column.
	Spark:
	df.orderBy("col").show()

	Missing Values:
	Pandas:
	df.isnull(): Returns a DataFrame of Boolean values indicating missing (NaN) values.
	df.notnull(): Returns a DataFrame of Boolean values indicating non-missing values.
	Spark:
	df.select([col(c).isNull().alias(c) for c in df.columns]).show()

	Fill Missing Values:
	Pandas:
	df.fillna(value) or df.fillna(method='ffill'): Fills missing values with a specified value or forward-fill method.
	Spark:
	df.fillna(value, subset=["col1", "col2"]).show()

	Drop Rows with Missing Values:
	Pandas:
	df.dropna(): Removes rows with any missing values.
	Spark:
	df.dropna()

	Drop Duplicates:
	Pandas:
	df.drop_duplicates(subset=['col1', 'col2']): Removes duplicate rows based on specified columns.
	Spark:
	df.dropDuplicates(subset=['col1', 'col2'])

	Apply a Function Element-Wise:
	Pandas:
	df.apply(func, axis=0): Applies a function along rows (axis=0) or columns (axis=1).
	Spark:
	from pyspark.sql.functions import udf
	from pyspark.sql.types import IntegerType

	def my_function(x):
	# Your custom logic here
	return x * 2

	my_udf = udf(my_function, IntegerType())
	df.withColumn('new_col', my_udf('col')).show()

	Iterate Over Rows:
	Pandas:
	df.iterrows(): Iterates over rows as (index, Series) pairs.
	Spark:
	Spark DataFrames do not have an equivalent built-in method for row-wise iteration.
	Consider using other Spark operations instead of explicit iteration.

	Select Rows Based on Condition:
	Pandas:
	df.loc[condition]: Selects rows based on a condition.
	Spark:
	df.filter(condition).show()

	Select Rows and Columns by Integer Location:
	Pandas:
	df.iloc[row_index, col_index]: Selects rows and columns by integer location.
	Spark:
	df.select(df.columns[col_index]).where(df.columns[row_index] == some_value).show()

	Perform an Inner Join with Another DataFrame:
	Pandas:
	pd.merge(df1, df2, on='col')
	Spark:
	df1.join(df2, on='col', how='inner').show()

	Joining DataFrames:
	Pandas:
	df.join(other_df, on='col', how='inner'): Joins two DataFrames based on a common column.
	Spark:
	df1.join(other_df, on='col', how='inner').show()

	Creating a Pivot Table:
	Pandas:
	df.pivot(index='col1', columns='col2', values='col3'): Creates a pivot table.
	Spark:
	df.groupBy('col1').pivot('col2').agg({'col3': 'first'}).show()