Skip to content

Instantly share code, notes, and snippets.

@pydemo
Created May 8, 2024 13:01
Show Gist options
  • Save pydemo/305e388cb2a861d72452e214d38b5e62 to your computer and use it in GitHub Desktop.
Save pydemo/305e388cb2a861d72452e214d38b5e62 to your computer and use it in GitHub Desktop.

What are the different join types available in Apache Spark, with code examples?

Join Type Description Code Example
Inner Join Returns rows when there is a match in both datasets. df1.join(df2, df1("key") === df2("key"))
Outer Join Includes full, left, and right outer joins. Returns all rows from both datasets, with matching rows from both sides where available. df1.join(df2, df1("key") === df2("key"), "outer")
Left Outer Join Returns all rows from the left dataset, and the matched rows from the right dataset. df1.join(df2, df1("key") === df2("key"), "left_outer")
Right Outer Join Returns all rows from the right dataset, and the matched rows from the left dataset. df1.join(df2, df1("key") === df2("key"), "right_outer")
Full Outer Join Combines the results of both left and right outer joins. df1.join(df2, df1("key") === df2("key"), "full_outer")
Left Semi Join Returns only the rows from the left dataset where a match is found in the right dataset. df1.join(df2, df1("key") === df2("key"), "left_semi")
Left Anti Join Returns only the rows from the left dataset for which there is no corresponding row in the right dataset. df1.join(df2, df1("key") === df2("key"), "left_anti")
Cross Join Produces the Cartesian product of two datasets. df1.crossJoin(df2)
Broadcast Join Used when one of the datasets is small enough to be broadcasted to all the nodes. df1.join(broadcast(df2), df1("key") === df2("key"))
Shuffle Hash Join Shuffles data based on the join key before joining. More efficient for larger datasets. spark.conf.set("spark.sql.join.preferSortMergeJoin", "false") df1.join(df2, df1("key") === df2("key"))
Sort Merge Join Both datasets are sorted on the join key and then merged. Efficient for large datasets. spark.conf.set("spark.sql.join.preferSortMergeJoin", "true") df1.join(df2, df1("key") === df2("key"))
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment