A small note on this tweet from @KevinUshey and this tweet from @ChengHLee:
The number of rows, while is important, is only one of the factors that influence the time taken to perform the join. From my benchmarking experience, the two features that I found to influence join speed, especially on hash table based approaches (ex: dplyr
), much more are:
- The number of unique groups.
- The number of columns to perform the join based on - note that this is also related to the previous point as in most cases, more the columns, more the number of unique groups.
That is, these features influence join speed in spite of having the same number of rows.