Skip to content

Instantly share code, notes, and snippets.

@igponce
Created June 24, 2015 21:20
Show Gist options
  • Save igponce/7804bf3736d5161fe15d to your computer and use it in GitHub Desktop.
Save igponce/7804bf3736d5161fe15d to your computer and use it in GitHub Desktop.
Spark sorted vs unsorted joins timing...
from time import time
inicio = time()
print ( amazonInvPairsRDD
.join( googleInvPairsRDD )
.map( lambda (a,b): (b,a))
.reduceByKey(lambda tok1, tok2: [tok1, tok2] )
).count()
print "Sin ordenar: Tiempo = %d" % ( time() - inicio )
print "***********"
inicio = time()
print "ordenando..."
amzSorted = amazonInvPairsRDD.sortByKey()
gooSorted = googleInvPairsRDD.sortByKey()
print "Tiempo para ordenar: %s " % ( time() - inicio)
inicio = time()
print ( amzSorted
.join( gooSorted )
.map( lambda (a,b): (b,a))
.reduceByKey(lambda tok1, tok2: [tok1, tok2] )
).count()
print "Ordenado: %d" % ( time() - inicio )
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment