Skip to content

Instantly share code, notes, and snippets.

@mkemp
Created May 11, 2015 15:05
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save mkemp/6d3e758fdce75f6bd035 to your computer and use it in GitHub Desktop.
Save mkemp/6d3e758fdce75f6bd035 to your computer and use it in GitHub Desktop.
Used in CloudCamp Chicago 2015.05.11 presentation.
$ pyspark
Python 2.7.2 (default, Oct 11 2012, 20:14:37)
[GCC 4.2.1 Compatible Apple Clang 4.0 (tags/Apple/clang-418.0.60)] on darwin
...
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/__ / .__/\_,_/_/ /_/\_\ version 1.1.0
/_/
Using Python version 2.7.2 (default, Oct 11 2012 20:14:37)
SparkContext available as sc.
>>> from word_count import word_count
>>> word_count(sc, 'text.txt', 'text_counts')
15/05/11 10:01:28 INFO MemoryStore: ensureFreeSpace(163705) called with curMem=0, maxMem=278302556
15/05/11 10:01:28 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 159.9 KB, free 265.3 MB)
15/05/11 10:01:29 INFO FileInputFormat: Total input paths to process : 1
15/05/11 10:01:29 INFO SparkContext: Starting job: sortByKey at word_count.py:11
15/05/11 10:01:29 INFO DAGScheduler: Registering RDD 4 (RDD at PythonRDD.scala:261)
15/05/11 10:01:29 INFO DAGScheduler: Got job 0 (sortByKey at word_count.py:11) with 2 output partitions (allowLocal=false)
15/05/11 10:01:29 INFO DAGScheduler: Final stage: Stage 0(sortByKey at word_count.py:11)
15/05/11 10:01:29 INFO DAGScheduler: Parents of final stage: List(Stage 1)
15/05/11 10:01:29 INFO DAGScheduler: Missing parents: List(Stage 1)
15/05/11 10:01:29 INFO DAGScheduler: Submitting Stage 1 (PairwiseRDD[4] at RDD at PythonRDD.scala:261), which has no missing parents
15/05/11 10:01:29 INFO MemoryStore: ensureFreeSpace(7776) called with curMem=163705, maxMem=278302556
15/05/11 10:01:29 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 7.6 KB, free 265.2 MB)
15/05/11 10:01:29 INFO DAGScheduler: Submitting 2 missing tasks from Stage 1 (PairwiseRDD[4] at RDD at PythonRDD.scala:261)
15/05/11 10:01:29 INFO TaskSchedulerImpl: Adding task set 1.0 with 2 tasks
15/05/11 10:01:29 INFO TaskSetManager: Starting task 0.0 in stage 1.0 (TID 0, localhost, PROCESS_LOCAL, 1204 bytes)
15/05/11 10:01:29 INFO TaskSetManager: Starting task 1.0 in stage 1.0 (TID 1, localhost, PROCESS_LOCAL, 1204 bytes)
15/05/11 10:01:29 INFO Executor: Running task 1.0 in stage 1.0 (TID 1)
15/05/11 10:01:29 INFO Executor: Running task 0.0 in stage 1.0 (TID 0)
15/05/11 10:01:29 INFO HadoopRDD: Input split: file:/tmp/cloudcamp/examples/text.txt:0+2644
15/05/11 10:01:29 INFO HadoopRDD: Input split: file:/tmp/cloudcamp/examples/text.txt:2644+2645
15/05/11 10:01:29 INFO deprecation: mapred.tip.id is deprecated. Instead, use mapreduce.task.id
15/05/11 10:01:29 INFO deprecation: mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id
15/05/11 10:01:29 INFO deprecation: mapred.task.is.map is deprecated. Instead, use mapreduce.task.ismap
15/05/11 10:01:29 INFO deprecation: mapred.task.partition is deprecated. Instead, use mapreduce.task.partition
15/05/11 10:01:29 INFO deprecation: mapred.job.id is deprecated. Instead, use mapreduce.job.id
15/05/11 10:01:29 INFO PythonRDD: Times: total = 599, boot = 570, init = 29, finish = 0
15/05/11 10:01:30 INFO PythonRDD: Times: total = 610, boot = 567, init = 32, finish = 11
15/05/11 10:01:30 INFO Executor: Finished task 0.0 in stage 1.0 (TID 0). 1863 bytes result sent to driver
15/05/11 10:01:30 INFO Executor: Finished task 1.0 in stage 1.0 (TID 1). 1863 bytes result sent to driver
15/05/11 10:01:30 INFO TaskSetManager: Finished task 1.0 in stage 1.0 (TID 1) in 805 ms on localhost (1/2)
15/05/11 10:01:30 INFO TaskSetManager: Finished task 0.0 in stage 1.0 (TID 0) in 833 ms on localhost (2/2)
15/05/11 10:01:30 INFO TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks have all completed, from pool
15/05/11 10:01:30 INFO DAGScheduler: Stage 1 (RDD at PythonRDD.scala:261) finished in 0.849 s
15/05/11 10:01:30 INFO DAGScheduler: looking for newly runnable stages
15/05/11 10:01:30 INFO DAGScheduler: running: Set()
15/05/11 10:01:30 INFO DAGScheduler: waiting: Set(Stage 0)
15/05/11 10:01:30 INFO DAGScheduler: failed: Set()
15/05/11 10:01:30 INFO DAGScheduler: Missing parents for Stage 0: List()
15/05/11 10:01:30 INFO DAGScheduler: Submitting Stage 0 (PythonRDD[8] at RDD at PythonRDD.scala:43), which is now runnable
15/05/11 10:01:30 INFO MemoryStore: ensureFreeSpace(5928) called with curMem=171481, maxMem=278302556
15/05/11 10:01:30 INFO MemoryStore: Block broadcast_2 stored as values in memory (estimated size 5.8 KB, free 265.2 MB)
15/05/11 10:01:30 INFO DAGScheduler: Submitting 2 missing tasks from Stage 0 (PythonRDD[8] at RDD at PythonRDD.scala:43)
15/05/11 10:01:30 INFO TaskSchedulerImpl: Adding task set 0.0 with 2 tasks
15/05/11 10:01:30 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 2, localhost, PROCESS_LOCAL, 948 bytes)
15/05/11 10:01:30 INFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID 3, localhost, PROCESS_LOCAL, 948 bytes)
15/05/11 10:01:30 INFO Executor: Running task 1.0 in stage 0.0 (TID 3)
15/05/11 10:01:30 INFO Executor: Running task 0.0 in stage 0.0 (TID 2)
15/05/11 10:01:30 INFO BlockFetcherIterator$BasicBlockFetcherIterator: maxBytesInFlight: 50331648, targetRequestSize: 10066329
15/05/11 10:01:30 INFO BlockFetcherIterator$BasicBlockFetcherIterator: maxBytesInFlight: 50331648, targetRequestSize: 10066329
15/05/11 10:01:30 INFO BlockFetcherIterator$BasicBlockFetcherIterator: Getting 1 non-empty blocks out of 2 blocks
15/05/11 10:01:30 INFO BlockFetcherIterator$BasicBlockFetcherIterator: Getting 1 non-empty blocks out of 2 blocks
15/05/11 10:01:30 INFO BlockFetcherIterator$BasicBlockFetcherIterator: Started 0 remote fetches in 6 ms
15/05/11 10:01:30 INFO BlockFetcherIterator$BasicBlockFetcherIterator: Started 0 remote fetches in 6 ms
15/05/11 10:01:30 INFO PythonRDD: Times: total = 38, boot = 4, init = 34, finish = 0
15/05/11 10:01:30 INFO PythonRDD: Times: total = 39, boot = 7, init = 31, finish = 1
15/05/11 10:01:30 INFO Executor: Finished task 0.0 in stage 0.0 (TID 2). 862 bytes result sent to driver
15/05/11 10:01:30 INFO Executor: Finished task 1.0 in stage 0.0 (TID 3). 862 bytes result sent to driver
15/05/11 10:01:30 INFO TaskSetManager: Finished task 1.0 in stage 0.0 (TID 3) in 52 ms on localhost (1/2)
15/05/11 10:01:30 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 2) in 58 ms on localhost (2/2)
15/05/11 10:01:30 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
15/05/11 10:01:30 INFO DAGScheduler: Stage 0 (sortByKey at word_count.py:11) finished in 0.065 s
15/05/11 10:01:30 INFO SparkContext: Job finished: sortByKey at word_count.py:11, took 1.014226 s
15/05/11 10:01:30 INFO SparkContext: Starting job: sortByKey at word_count.py:11
15/05/11 10:01:30 INFO MapOutputTrackerMaster: Size of output statuses for shuffle 0 is 146 bytes
15/05/11 10:01:30 INFO DAGScheduler: Got job 1 (sortByKey at word_count.py:11) with 2 output partitions (allowLocal=false)
15/05/11 10:01:30 INFO DAGScheduler: Final stage: Stage 2(sortByKey at word_count.py:11)
15/05/11 10:01:30 INFO DAGScheduler: Parents of final stage: List(Stage 3)
15/05/11 10:01:30 INFO DAGScheduler: Missing parents: List()
15/05/11 10:01:30 INFO DAGScheduler: Submitting Stage 2 (PythonRDD[9] at RDD at PythonRDD.scala:43), which has no missing parents
15/05/11 10:01:30 INFO MemoryStore: ensureFreeSpace(5968) called with curMem=177409, maxMem=278302556
15/05/11 10:01:30 INFO MemoryStore: Block broadcast_3 stored as values in memory (estimated size 5.8 KB, free 265.2 MB)
15/05/11 10:01:30 INFO DAGScheduler: Submitting 2 missing tasks from Stage 2 (PythonRDD[9] at RDD at PythonRDD.scala:43)
15/05/11 10:01:30 INFO TaskSchedulerImpl: Adding task set 2.0 with 2 tasks
15/05/11 10:01:30 INFO TaskSetManager: Starting task 0.0 in stage 2.0 (TID 4, localhost, PROCESS_LOCAL, 948 bytes)
15/05/11 10:01:30 INFO TaskSetManager: Starting task 1.0 in stage 2.0 (TID 5, localhost, PROCESS_LOCAL, 948 bytes)
15/05/11 10:01:30 INFO Executor: Running task 0.0 in stage 2.0 (TID 4)
15/05/11 10:01:30 INFO Executor: Running task 1.0 in stage 2.0 (TID 5)
15/05/11 10:01:30 INFO BlockFetcherIterator$BasicBlockFetcherIterator: maxBytesInFlight: 50331648, targetRequestSize: 10066329
15/05/11 10:01:30 INFO BlockFetcherIterator$BasicBlockFetcherIterator: Getting 1 non-empty blocks out of 2 blocks
15/05/11 10:01:30 INFO BlockFetcherIterator$BasicBlockFetcherIterator: Started 0 remote fetches in 0 ms
15/05/11 10:01:30 INFO BlockFetcherIterator$BasicBlockFetcherIterator: maxBytesInFlight: 50331648, targetRequestSize: 10066329
15/05/11 10:01:30 INFO BlockFetcherIterator$BasicBlockFetcherIterator: Getting 1 non-empty blocks out of 2 blocks
15/05/11 10:01:30 INFO BlockFetcherIterator$BasicBlockFetcherIterator: Started 0 remote fetches in 0 ms
15/05/11 10:01:30 INFO PythonRDD: Times: total = 16, boot = 3, init = 8, finish = 5
15/05/11 10:01:30 INFO PythonRDD: Times: total = 14, boot = 3, init = 8, finish = 3
15/05/11 10:01:30 INFO Executor: Finished task 0.0 in stage 2.0 (TID 4). 1202 bytes result sent to driver
15/05/11 10:01:30 INFO Executor: Finished task 1.0 in stage 2.0 (TID 5). 1184 bytes result sent to driver
15/05/11 10:01:30 INFO TaskSetManager: Finished task 0.0 in stage 2.0 (TID 4) in 30 ms on localhost (1/2)
15/05/11 10:01:30 INFO TaskSetManager: Finished task 1.0 in stage 2.0 (TID 5) in 33 ms on localhost (2/2)
15/05/11 10:01:30 INFO TaskSchedulerImpl: Removed TaskSet 2.0, whose tasks have all completed, from pool
15/05/11 10:01:30 INFO DAGScheduler: Stage 2 (sortByKey at word_count.py:11) finished in 0.041 s
15/05/11 10:01:30 INFO SparkContext: Job finished: sortByKey at word_count.py:11, took 0.054668 s
15/05/11 10:01:30 INFO SparkContext: Starting job: saveAsTextFile at NativeMethodAccessorImpl.java:-2
15/05/11 10:01:30 INFO DAGScheduler: Registering RDD 11 (RDD at PythonRDD.scala:261)
15/05/11 10:01:30 INFO DAGScheduler: Got job 2 (saveAsTextFile at NativeMethodAccessorImpl.java:-2) with 2 output partitions (allowLocal=false)
15/05/11 10:01:30 INFO DAGScheduler: Final stage: Stage 4(saveAsTextFile at NativeMethodAccessorImpl.java:-2)
15/05/11 10:01:30 INFO DAGScheduler: Parents of final stage: List(Stage 6)
15/05/11 10:01:30 INFO DAGScheduler: Missing parents: List(Stage 6)
15/05/11 10:01:30 INFO DAGScheduler: Submitting Stage 6 (PairwiseRDD[11] at RDD at PythonRDD.scala:261), which has no missing parents
15/05/11 10:01:30 INFO MemoryStore: ensureFreeSpace(6840) called with curMem=183377, maxMem=278302556
15/05/11 10:01:30 INFO MemoryStore: Block broadcast_4 stored as values in memory (estimated size 6.7 KB, free 265.2 MB)
15/05/11 10:01:30 INFO DAGScheduler: Submitting 2 missing tasks from Stage 6 (PairwiseRDD[11] at RDD at PythonRDD.scala:261)
15/05/11 10:01:30 INFO TaskSchedulerImpl: Adding task set 6.0 with 2 tasks
15/05/11 10:01:30 INFO TaskSetManager: Starting task 0.0 in stage 6.0 (TID 6, localhost, PROCESS_LOCAL, 937 bytes)
15/05/11 10:01:30 INFO TaskSetManager: Starting task 1.0 in stage 6.0 (TID 7, localhost, PROCESS_LOCAL, 937 bytes)
15/05/11 10:01:30 INFO Executor: Running task 0.0 in stage 6.0 (TID 6)
15/05/11 10:01:30 INFO Executor: Running task 1.0 in stage 6.0 (TID 7)
15/05/11 10:01:30 INFO BlockFetcherIterator$BasicBlockFetcherIterator: maxBytesInFlight: 50331648, targetRequestSize: 10066329
15/05/11 10:01:30 INFO BlockFetcherIterator$BasicBlockFetcherIterator: Getting 1 non-empty blocks out of 2 blocks
15/05/11 10:01:30 INFO BlockFetcherIterator$BasicBlockFetcherIterator: Started 0 remote fetches in 1 ms
15/05/11 10:01:30 INFO BlockFetcherIterator$BasicBlockFetcherIterator: maxBytesInFlight: 50331648, targetRequestSize: 10066329
15/05/11 10:01:30 INFO BlockFetcherIterator$BasicBlockFetcherIterator: Getting 1 non-empty blocks out of 2 blocks
15/05/11 10:01:30 INFO BlockFetcherIterator$BasicBlockFetcherIterator: Started 0 remote fetches in 1 ms
15/05/11 10:01:30 INFO PythonRDD: Times: total = 12, boot = 4, init = 7, finish = 1
15/05/11 10:01:30 INFO PythonRDD: Times: total = 14, boot = 3, init = 9, finish = 2
15/05/11 10:01:30 INFO Executor: Finished task 1.0 in stage 6.0 (TID 7). 996 bytes result sent to driver
15/05/11 10:01:30 INFO Executor: Finished task 0.0 in stage 6.0 (TID 6). 996 bytes result sent to driver
15/05/11 10:01:30 INFO TaskSetManager: Finished task 1.0 in stage 6.0 (TID 7) in 38 ms on localhost (1/2)
15/05/11 10:01:30 INFO TaskSetManager: Finished task 0.0 in stage 6.0 (TID 6) in 45 ms on localhost (2/2)
15/05/11 10:01:30 INFO TaskSchedulerImpl: Removed TaskSet 6.0, whose tasks have all completed, from pool
15/05/11 10:01:30 INFO DAGScheduler: Stage 6 (RDD at PythonRDD.scala:261) finished in 0.050 s
15/05/11 10:01:30 INFO DAGScheduler: looking for newly runnable stages
15/05/11 10:01:30 INFO DAGScheduler: running: Set()
15/05/11 10:01:30 INFO DAGScheduler: waiting: Set(Stage 4)
15/05/11 10:01:30 INFO DAGScheduler: failed: Set()
15/05/11 10:01:30 INFO DAGScheduler: Missing parents for Stage 4: List()
15/05/11 10:01:30 INFO DAGScheduler: Submitting Stage 4 (MappedRDD[16] at saveAsTextFile at NativeMethodAccessorImpl.java:-2), which is now runnable
15/05/11 10:01:30 INFO MemoryStore: ensureFreeSpace(61344) called with curMem=190217, maxMem=278302556
15/05/11 10:01:30 INFO MemoryStore: Block broadcast_5 stored as values in memory (estimated size 59.9 KB, free 265.2 MB)
15/05/11 10:01:30 INFO DAGScheduler: Submitting 2 missing tasks from Stage 4 (MappedRDD[16] at saveAsTextFile at NativeMethodAccessorImpl.java:-2)
15/05/11 10:01:30 INFO TaskSchedulerImpl: Adding task set 4.0 with 2 tasks
15/05/11 10:01:30 INFO TaskSetManager: Starting task 0.0 in stage 4.0 (TID 8, localhost, PROCESS_LOCAL, 948 bytes)
15/05/11 10:01:30 INFO TaskSetManager: Starting task 1.0 in stage 4.0 (TID 9, localhost, PROCESS_LOCAL, 948 bytes)
15/05/11 10:01:30 INFO Executor: Running task 0.0 in stage 4.0 (TID 8)
15/05/11 10:01:30 INFO Executor: Running task 1.0 in stage 4.0 (TID 9)
15/05/11 10:01:30 INFO BlockFetcherIterator$BasicBlockFetcherIterator: maxBytesInFlight: 50331648, targetRequestSize: 10066329
15/05/11 10:01:30 INFO BlockFetcherIterator$BasicBlockFetcherIterator: Getting 2 non-empty blocks out of 2 blocks
15/05/11 10:01:30 INFO BlockFetcherIterator$BasicBlockFetcherIterator: Started 0 remote fetches in 0 ms
15/05/11 10:01:30 INFO BlockFetcherIterator$BasicBlockFetcherIterator: maxBytesInFlight: 50331648, targetRequestSize: 10066329
15/05/11 10:01:30 INFO BlockFetcherIterator$BasicBlockFetcherIterator: Getting 2 non-empty blocks out of 2 blocks
15/05/11 10:01:30 INFO BlockFetcherIterator$BasicBlockFetcherIterator: Started 0 remote fetches in 0 ms
15/05/11 10:01:30 INFO PythonRDD: Times: total = 14, boot = 2, init = 10, finish = 2
15/05/11 10:01:30 INFO PythonRDD: Times: total = 16, boot = 4, init = 10, finish = 2
15/05/11 10:01:30 INFO FileOutputCommitter: Saved output of task 'attempt_201505111001_0004_m_000001_9' to file:/tmp/cloudcamp/examples/text_counts/_temporary/0/task_201505111001_0004_m_000001
15/05/11 10:01:30 INFO FileOutputCommitter: Saved output of task 'attempt_201505111001_0004_m_000000_8' to file:/tmp/cloudcamp/examples/text_counts/_temporary/0/task_201505111001_0004_m_000000
15/05/11 10:01:30 INFO SparkHadoopWriter: attempt_201505111001_0004_m_000001_9: Committed
15/05/11 10:01:30 INFO SparkHadoopWriter: attempt_201505111001_0004_m_000000_8: Committed
15/05/11 10:01:30 INFO Executor: Finished task 1.0 in stage 4.0 (TID 9). 846 bytes result sent to driver
15/05/11 10:01:30 INFO Executor: Finished task 0.0 in stage 4.0 (TID 8). 846 bytes result sent to driver
15/05/11 10:01:30 INFO TaskSetManager: Finished task 1.0 in stage 4.0 (TID 9) in 105 ms on localhost (1/2)
15/05/11 10:01:30 INFO TaskSetManager: Finished task 0.0 in stage 4.0 (TID 8) in 112 ms on localhost (2/2)
15/05/11 10:01:30 INFO TaskSchedulerImpl: Removed TaskSet 4.0, whose tasks have all completed, from pool
15/05/11 10:01:30 INFO DAGScheduler: Stage 4 (saveAsTextFile at NativeMethodAccessorImpl.java:-2) finished in 0.116 s
15/05/11 10:01:30 INFO SparkContext: Job finished: saveAsTextFile at NativeMethodAccessorImpl.java:-2, took 0.207841 s
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment