pydemo/chat.md

## chat.md

      
    Raw
  

              chat.md
            
          
    What are the different join types available in Apache Spark, with code examples?


Join Type
Description
Code Example


Inner Join
Returns rows when there is a match in both datasets.
df1.join(df2, df1("key") === df2("key"))


Outer Join
Includes full, left, and right outer joins. Returns all rows from both datasets, with matching rows from both sides where available.
df1.join(df2, df1("key") === df2("key"), "outer")


Left Outer Join
Returns all rows from the left dataset, and the matched rows from the right dataset.
df1.join(df2, df1("key") === df2("key"), "left_outer")


Right Outer Join
Returns all rows from the right dataset, and the matched rows from the left dataset.
df1.join(df2, df1("key") === df2("key"), "right_outer")


Full Outer Join
Combines the results of both left and right outer joins.
df1.join(df2, df1("key") === df2("key"), "full_outer")


Left Semi Join
Returns only the rows from the left dataset where a match is found in the right dataset.
df1.join(df2, df1("key") === df2("key"), "left_semi")


Left Anti Join
Returns only the rows from the left dataset for which there is no corresponding row in the right dataset.
df1.join(df2, df1("key") === df2("key"), "left_anti")


Cross Join
Produces the Cartesian product of two datasets.
df1.crossJoin(df2)


Broadcast Join
Used when one of the datasets is small enough to be broadcasted to all the nodes.
df1.join(broadcast(df2), df1("key") === df2("key"))


Shuffle Hash Join
Shuffles data based on the join key before joining. More efficient for larger datasets.
spark.conf.set("spark.sql.join.preferSortMergeJoin", "false") df1.join(df2, df1("key") === df2("key"))


Sort Merge Join
Both datasets are sorted on the join key and then merged. Efficient for large datasets.
spark.conf.set("spark.sql.join.preferSortMergeJoin", "true") df1.join(df2, df1("key") === df2("key"))
Join Type	Description	Code Example
Inner Join	Returns rows when there is a match in both datasets.	`df1.join(df2, df1("key") === df2("key"))`
Outer Join	Includes full, left, and right outer joins. Returns all rows from both datasets, with matching rows from both sides where available.	`df1.join(df2, df1("key") === df2("key"), "outer")`
Left Outer Join	Returns all rows from the left dataset, and the matched rows from the right dataset.	`df1.join(df2, df1("key") === df2("key"), "left_outer")`
Right Outer Join	Returns all rows from the right dataset, and the matched rows from the left dataset.	`df1.join(df2, df1("key") === df2("key"), "right_outer")`
Full Outer Join	Combines the results of both left and right outer joins.	`df1.join(df2, df1("key") === df2("key"), "full_outer")`
Left Semi Join	Returns only the rows from the left dataset where a match is found in the right dataset.	`df1.join(df2, df1("key") === df2("key"), "left_semi")`
Left Anti Join	Returns only the rows from the left dataset for which there is no corresponding row in the right dataset.	`df1.join(df2, df1("key") === df2("key"), "left_anti")`
Cross Join	Produces the Cartesian product of two datasets.	`df1.crossJoin(df2)`
Broadcast Join	Used when one of the datasets is small enough to be broadcasted to all the nodes.	`df1.join(broadcast(df2), df1("key") === df2("key"))`
Shuffle Hash Join	Shuffles data based on the join key before joining. More efficient for larger datasets.	`spark.conf.set("spark.sql.join.preferSortMergeJoin", "false") df1.join(df2, df1("key") === df2("key"))`
Sort Merge Join	Both datasets are sorted on the join key and then merged. Efficient for large datasets.	`spark.conf.set("spark.sql.join.preferSortMergeJoin", "true") df1.join(df2, df1("key") === df2("key"))`