Skip to content

Instantly share code, notes, and snippets.

@emjayess
Last active August 29, 2015 14:14
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save emjayess/c7c96c19678f945a6a31 to your computer and use it in GitHub Desktop.
Save emjayess/c7c96c19678f945a6a31 to your computer and use it in GitHub Desktop.
Apache Spark & 'mcmath' NormTermOrder 10k
macarooni:geekout emjayess$ pyspark
Python 2.7.5 (default, Mar 9 2014, 22:15:05)
[GCC 4.2.1 Compatible Apple LLVM 5.0 (clang-500.0.68)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
Spark assembly has been built with Hive, including Datanucleus jars on classpath
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
15/02/03 09:18:31 INFO SecurityManager: Changing view acls to: emjayess
15/02/03 09:18:31 INFO SecurityManager: Changing modify acls to: emjayess
15/02/03 09:18:31 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(emjayess); users with modify permissions: Set(emjayess)
15/02/03 09:18:32 INFO Slf4jLogger: Slf4jLogger started
15/02/03 09:18:32 INFO Remoting: Starting remoting
15/02/03 09:18:32 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriver@192.168.1.103:63274]
15/02/03 09:18:32 INFO Utils: Successfully started service 'sparkDriver' on port 63274.
15/02/03 09:18:32 INFO SparkEnv: Registering MapOutputTracker
15/02/03 09:18:32 INFO SparkEnv: Registering BlockManagerMaster
15/02/03 09:18:32 INFO DiskBlockManager: Created local directory at /var/folders/dn/4lp40f8d55d0l_glfztzq6wc0000gn/T/spark-local-20150203091832-1849
15/02/03 09:18:32 INFO MemoryStore: MemoryStore started with capacity 273.0 MB
2015-02-03 09:18:32.896 java[23456:c003] Unable to load realm info from SCDynamicStore
15/02/03 09:18:32 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
15/02/03 09:18:33 INFO HttpFileServer: HTTP File server directory is /var/folders/dn/4lp40f8d55d0l_glfztzq6wc0000gn/T/spark-6482de0d-6678-4cb2-bf33-b0752edf56f3
15/02/03 09:18:33 INFO HttpServer: Starting HTTP Server
15/02/03 09:18:33 INFO Utils: Successfully started service 'HTTP file server' on port 63275.
15/02/03 09:18:33 INFO Utils: Successfully started service 'SparkUI' on port 4040.
15/02/03 09:18:33 INFO SparkUI: Started SparkUI at http://192.168.1.103:4040
15/02/03 09:18:33 INFO AkkaUtils: Connecting to HeartbeatReceiver: akka.tcp://sparkDriver@192.168.1.103:63274/user/HeartbeatReceiver
15/02/03 09:18:33 INFO NettyBlockTransferService: Server created on 63279
15/02/03 09:18:33 INFO BlockManagerMaster: Trying to register BlockManager
15/02/03 09:18:33 INFO BlockManagerMasterActor: Registering block manager localhost:63279 with 273.0 MB RAM, BlockManagerId(<driver>, localhost, 63279)
15/02/03 09:18:33 INFO BlockManagerMaster: Registered BlockManager
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/__ / .__/\_,_/_/ /_/\_\ version 1.2.0
/_/
Using Python version 2.7.5 (default, Mar 9 2014 22:15:05)
SparkContext available as sc.
>>>
>>> nto10k = sc.textFile("NormTermOrder10000.csv")
15/02/03 09:19:11 INFO MemoryStore: ensureFreeSpace(172851) called with curMem=0, maxMem=286300569
15/02/03 09:19:11 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 168.8 KB, free 272.9 MB)
15/02/03 09:19:11 INFO MemoryStore: ensureFreeSpace(22692) called with curMem=172851, maxMem=286300569
15/02/03 09:19:11 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 22.2 KB, free 272.9 MB)
15/02/03 09:19:11 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on localhost:63279 (size: 22.2 KB, free: 273.0 MB)
15/02/03 09:19:11 INFO BlockManagerMaster: Updated info of block broadcast_0_piece0
15/02/03 09:19:11 INFO SparkContext: Created broadcast 0 from textFile at NativeMethodAccessorImpl.java:-2
>>> nto10k.count()
15/02/03 09:19:30 INFO FileInputFormat: Total input paths to process : 1
15/02/03 09:19:30 INFO SparkContext: Starting job: count at <stdin>:1
15/02/03 09:19:30 INFO DAGScheduler: Got job 0 (count at <stdin>:1) with 2 output partitions (allowLocal=false)
15/02/03 09:19:30 INFO DAGScheduler: Final stage: Stage 0(count at <stdin>:1)
15/02/03 09:19:30 INFO DAGScheduler: Parents of final stage: List()
15/02/03 09:19:30 INFO DAGScheduler: Missing parents: List()
15/02/03 09:19:30 INFO DAGScheduler: Submitting Stage 0 (PythonRDD[2] at count at <stdin>:1), which has no missing parents
15/02/03 09:19:30 INFO MemoryStore: ensureFreeSpace(5488) called with curMem=195543, maxMem=286300569
15/02/03 09:19:30 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 5.4 KB, free 272.8 MB)
15/02/03 09:19:30 INFO MemoryStore: ensureFreeSpace(4090) called with curMem=201031, maxMem=286300569
15/02/03 09:19:30 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 4.0 KB, free 272.8 MB)
15/02/03 09:19:30 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on localhost:63279 (size: 4.0 KB, free: 273.0 MB)
15/02/03 09:19:30 INFO BlockManagerMaster: Updated info of block broadcast_1_piece0
15/02/03 09:19:30 INFO SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:838
15/02/03 09:19:30 INFO DAGScheduler: Submitting 2 missing tasks from Stage 0 (PythonRDD[2] at count at <stdin>:1)
15/02/03 09:19:30 INFO TaskSchedulerImpl: Adding task set 0.0 with 2 tasks
15/02/03 09:19:30 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, localhost, PROCESS_LOCAL, 1346 bytes)
15/02/03 09:19:30 INFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, localhost, PROCESS_LOCAL, 1346 bytes)
15/02/03 09:19:30 INFO Executor: Running task 1.0 in stage 0.0 (TID 1)
15/02/03 09:19:30 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)
15/02/03 09:19:31 INFO HadoopRDD: Input split: file:/Users/emjayess/Sources/geek_meet_code/mc.math/geekout/NormTermOrder10000.csv:0+34791
15/02/03 09:19:31 INFO HadoopRDD: Input split: file:/Users/emjayess/Sources/geek_meet_code/mc.math/geekout/NormTermOrder10000.csv:34791+34791
15/02/03 09:19:31 INFO deprecation: mapred.tip.id is deprecated. Instead, use mapreduce.task.id
15/02/03 09:19:31 INFO deprecation: mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id
15/02/03 09:19:31 INFO deprecation: mapred.task.is.map is deprecated. Instead, use mapreduce.task.ismap
15/02/03 09:19:31 INFO deprecation: mapred.task.partition is deprecated. Instead, use mapreduce.task.partition
15/02/03 09:19:31 INFO deprecation: mapred.job.id is deprecated. Instead, use mapreduce.job.id
15/02/03 09:19:31 INFO PythonRDD: Times: total = 1116, boot = 939, init = 142, finish = 35
15/02/03 09:19:31 INFO PythonRDD: Times: total = 1120, boot = 943, init = 137, finish = 40
15/02/03 09:19:31 INFO Executor: Finished task 1.0 in stage 0.0 (TID 1). 1798 bytes result sent to driver
15/02/03 09:19:31 INFO Executor: Finished task 0.0 in stage 0.0 (TID 0). 1798 bytes result sent to driver
15/02/03 09:19:31 INFO TaskSetManager: Finished task 1.0 in stage 0.0 (TID 1) in 1200 ms on localhost (1/2)
15/02/03 09:19:31 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 1251 ms on localhost (2/2)
15/02/03 09:19:31 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
15/02/03 09:19:31 INFO DAGScheduler: Stage 0 (count at <stdin>:1) finished in 1.269 s
15/02/03 09:19:31 INFO DAGScheduler: Job 0 finished: count at <stdin>:1, took 1.349521 s
10200
>>> nto10k.filter(lambda line: "3" in line).count()
15/02/03 09:20:57 INFO SparkContext: Starting job: count at <stdin>:1
15/02/03 09:20:57 INFO DAGScheduler: Got job 1 (count at <stdin>:1) with 2 output partitions (allowLocal=false)
15/02/03 09:20:57 INFO DAGScheduler: Final stage: Stage 1(count at <stdin>:1)
15/02/03 09:20:57 INFO DAGScheduler: Parents of final stage: List()
15/02/03 09:20:57 INFO DAGScheduler: Missing parents: List()
15/02/03 09:20:57 INFO DAGScheduler: Submitting Stage 1 (PythonRDD[3] at count at <stdin>:1), which has no missing parents
15/02/03 09:20:57 INFO MemoryStore: ensureFreeSpace(5880) called with curMem=205121, maxMem=286300569
15/02/03 09:20:57 INFO MemoryStore: Block broadcast_2 stored as values in memory (estimated size 5.7 KB, free 272.8 MB)
15/02/03 09:20:57 INFO MemoryStore: ensureFreeSpace(4358) called with curMem=211001, maxMem=286300569
15/02/03 09:20:57 INFO MemoryStore: Block broadcast_2_piece0 stored as bytes in memory (estimated size 4.3 KB, free 272.8 MB)
15/02/03 09:20:57 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on localhost:63279 (size: 4.3 KB, free: 273.0 MB)
15/02/03 09:20:57 INFO BlockManagerMaster: Updated info of block broadcast_2_piece0
15/02/03 09:20:57 INFO SparkContext: Created broadcast 2 from broadcast at DAGScheduler.scala:838
15/02/03 09:20:57 INFO DAGScheduler: Submitting 2 missing tasks from Stage 1 (PythonRDD[3] at count at <stdin>:1)
15/02/03 09:20:57 INFO TaskSchedulerImpl: Adding task set 1.0 with 2 tasks
15/02/03 09:20:57 INFO TaskSetManager: Starting task 0.0 in stage 1.0 (TID 2, localhost, PROCESS_LOCAL, 1346 bytes)
15/02/03 09:20:57 INFO TaskSetManager: Starting task 1.0 in stage 1.0 (TID 3, localhost, PROCESS_LOCAL, 1346 bytes)
15/02/03 09:20:57 INFO Executor: Running task 0.0 in stage 1.0 (TID 2)
15/02/03 09:20:57 INFO Executor: Running task 1.0 in stage 1.0 (TID 3)
15/02/03 09:20:57 INFO HadoopRDD: Input split: file:/Users/emjayess/Sources/geek_meet_code/mc.math/geekout/NormTermOrder10000.csv:0+34791
15/02/03 09:20:57 INFO HadoopRDD: Input split: file:/Users/emjayess/Sources/geek_meet_code/mc.math/geekout/NormTermOrder10000.csv:34791+34791
15/02/03 09:20:57 INFO PythonRDD: Times: total = 106, boot = 5, init = 48, finish = 53
15/02/03 09:20:57 INFO Executor: Finished task 1.0 in stage 1.0 (TID 3). 1798 bytes result sent to driver
15/02/03 09:20:57 INFO TaskSetManager: Finished task 1.0 in stage 1.0 (TID 3) in 118 ms on localhost (1/2)
15/02/03 09:20:57 INFO PythonRDD: Times: total = 112, boot = 3, init = 55, finish = 54
15/02/03 09:20:57 INFO Executor: Finished task 0.0 in stage 1.0 (TID 2). 1798 bytes result sent to driver
15/02/03 09:20:57 INFO TaskSetManager: Finished task 0.0 in stage 1.0 (TID 2) in 127 ms on localhost (2/2)
15/02/03 09:20:57 INFO TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks have all completed, from pool
15/02/03 09:20:57 INFO DAGScheduler: Stage 1 (count at <stdin>:1) finished in 0.133 s
15/02/03 09:20:57 INFO DAGScheduler: Job 1 finished: count at <stdin>:1, took 0.145861 s
3477
@emjayess
Copy link
Author

emjayess commented Feb 3, 2015

Quick Start w/Spark
... including python examples.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment