What are the different join types available in Apache Spark, with code examples?
Join Type | Description | Code Example |
---|---|---|
Inner Join | Returns rows when there is a match in both datasets. | df1.join(df2, df1("key") === df2("key")) |
Outer Join | Includes full, left, and right outer joins. Returns all rows from both datasets, with matching rows from both sides where available. | df1.join(df2, df1("key") === df2("key"), "outer") |
Left Outer Join | Returns all rows from the left dataset, and the matched rows from the right dataset. | df1.join(df2, df1("key") === df2("key"), "left_outer") |
Right Outer Join | Returns all rows from the right dataset, and the matched rows from the left dataset. | df1.join(df2, df1("key") === df2("key"), "right_outer") |
Full Outer Join | Combines the results of both left and right outer joins. | df1.join(df2, df1("key") === df2("key"), "full_outer") |
Left Semi Join | Returns only the rows from the left dataset where a match is found in the right dataset. | df1.join(df2, df1("key") === df2("key"), "left_semi") |
Left Anti Join | Returns only the rows from the left dataset for which there is no corresponding row in the right dataset. | df1.join(df2, df1("key") === df2("key"), "left_anti") |
Cross Join | Produces the Cartesian product of two datasets. | df1.crossJoin(df2) |
Broadcast Join | Used when one of the datasets is small enough to be broadcasted to all the nodes. | df1.join(broadcast(df2), df1("key") === df2("key")) |
Shuffle Hash Join | Shuffles data based on the join key before joining. More efficient for larger datasets. | spark.conf.set("spark.sql.join.preferSortMergeJoin", "false") df1.join(df2, df1("key") === df2("key")) |
Sort Merge Join | Both datasets are sorted on the join key and then merged. Efficient for large datasets. | spark.conf.set("spark.sql.join.preferSortMergeJoin", "true") df1.join(df2, df1("key") === df2("key")) |