Skip to content

Instantly share code, notes, and snippets.

@jrraymond
Created October 25, 2016 21:55
Show Gist options
  • Save jrraymond/031b507bf02757caa5ab7383a8bcfbeb to your computer and use it in GitHub Desktop.
Save jrraymond/031b507bf02757caa5ab7383a8bcfbeb to your computer and use it in GitHub Desktop.
shuffle.partitions=200
collect: 5s, 2 stages, 220 tasks
runJob: 8s, 6 stages, 660 tasks
shuffle.partitions=60
collect: 5s, 2stages, 80 tasks
runJob: 8s, 6 stages, 240 tasks
collect:
stage 1 remains the same, 5s, 20 tasks, 1909.2MB input, 48.1MB shuffle write
stage 2 has less tasks, 0.3->0.4s, 200 -> 60 tasks, 48.1MB shuffle read
the event timeline is more balanced, all tasks start at the same time
shuffle read size is much more balanced
runRDD:
stages 1,4,5 no difference because less tasks (20) than partitions/cores
stage 5: shuffle write 2042.2KB -> 1874.8KB
stage 4: shuffle write 2037.8KB -> 1872.0KB
stages 2,3,6 (200 tasks -> 60 tasks)
stages 2:
tasks start at same time.
shuffle read size median: 10.1MB/502 -> 34.5MB/1706
shuffle write size median: 12.7KB/501 -> 25.8KB/1660
individual tasks take longer, but fewer and all start at same time, so comes out in the wash
shuffle read constant at 2.0GB
shuffle write 2.5MB -> 1549.7KB
stage 3:
tasks don't start at same time, but all start quicker than before
tasks take longer but start earlier, and less of them
input mean size median: 14.4MB/204 -> 47.8MB/12
shuffle write size median: 11.1KB/501 -> 18.5KB/1671
shuffle write 2.2MB -> 1105.2KB
stage 6:
shuffle read 8.5MB -> 6.3MB
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment