Skip to content

Instantly share code, notes, and snippets.

@bob-ortiz
Created July 12, 2023 05:54
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save bob-ortiz/0166303598b89dc07827ba855e102ff9 to your computer and use it in GitHub Desktop.
Save bob-ortiz/0166303598b89dc07827ba855e102ff9 to your computer and use it in GitHub Desktop.
Difference between a merge and a join in pandas

Note

The following answer was provided by Bart when I asked the difference between merge and join. I thought the answer was good enough and wanted to share it and to also remember this important difference.

Output

The join() and merge() methods in pandas are used to combine two DataFrames. They are similar in that they both combine DataFrames based on common columns or indices. However, there are some key differences between the two methods:

  • join() only joins on the index. If you want to join on columns, you need to set the on parameter.
  • merge() can join on columns or indices. You can specify the columns to join on with the on parameter, or you can specify the indices to join on with the left_index and right_index parameters.
  • join() by default performs a left join. This means that all rows from the left DataFrame are included in the result, even if there are no matching rows in the right DataFrame.
  • merge() by default performs an inner join. This means that only rows that have matching values in both DataFrames are included in the result.

In addition to these differences, there are some other things to keep in mind when using join() and merge():

  • join() is a DataFrame method, while merge() is a function in the pandas namespace.
  • join() can be used to join DataFrames on a DataFrame's index, but merge() cannot.
  • merge() can be used to join DataFrames on a combination of columns and indices, while join() can only join on a single index.

Here is an example of how to use join() and merge():

import pandas as pd

# Create two DataFrames
df1 = pd.DataFrame({'Name': ['Alice', 'Bob', 'Carol'], 'Age': [20, 30, 40]})
df2 = pd.DataFrame({'Name': ['Alice', 'Carol', 'Dave'], 'Country': ['USA', 'UK', 'Canada']})

# Join the DataFrames on the 'Name' column
df_joined = df1.join(df2, on='Name')

# Merge the DataFrames on the 'Name' column
df_merged = pd.merge(df1, df2, on='Name')

# Print the results
print(df_joined)
print(df_merged)

The output of the join() method will be a DataFrame with all rows from df1, and the matching rows from df2. The output of the merge() method will be a DataFrame with only the rows that have matching values in both df1 and df2.

I hope this helps! Let me know if you have any other questions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment