Skip to content

Instantly share code, notes, and snippets.

@ftorto
Last active January 31, 2017 12:10
Show Gist options
  • Save ftorto/a21b72f25a2f7da38a041f6b1a598e24 to your computer and use it in GitHub Desktop.
Save ftorto/a21b72f25a2f7da38a041f6b1a598e24 to your computer and use it in GitHub Desktop.
Spark Transformations examples

Map

>>> rdd = sc.parallelize([1,2,3])
>>> rdd.map(lambda x: [x,x*10])
RDD: [1,2,3] -> [[1,10], [2,20], [3,30]]

>>> rdd.flatMap(lambda x: [x,x*10])
RDD: [1,2,3] -> [[1,10,2,20,3,30]]

Key-Value Transformation

Transformation type
reduceByKey(func) (V,V) -> V
sortByKey() (K,V) -> sorted(K),V)
groupByKey() (K,V) -> K, iter(V)
>>> rdd = sc.parallelize([(1,2),(3,4),(1,32)])
>>> rdd.reduceByKey(lambda x,y: x+y)
((1,34),(3,4))
>>> rdd.sortByKey()
((1,2),(1,32),(3,4))
>>> rdd.groupByKey()
(1,[2,32]),(3,[4]))

RDD.toDebugString()

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment