Skip to content

Instantly share code, notes, and snippets.

@ssimeonov
Created July 2, 2015 19:36
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save ssimeonov/a49b75dc086c3ac6f3c4 to your computer and use it in GitHub Desktop.
Save ssimeonov/a49b75dc086c3ac6f3c4 to your computer and use it in GitHub Desktop.

Spark 1.4.0 regression: out-of-memory conditions on small data

A very simple Spark SQL COUNT operation succeeds in spark-shell for 1.3.1 and fails with a series of out-of-memory errors in 1.4.0.

The data in question is a single file of 88,283 JSON objects with at most 109 fields per object. Size on disk is 181Mb.

This gist includes the code and the full output from the 1.3.1 and 1.4.0 runs, including the command line showing how spark-shell is started.

import org.apache.spark.sql.hive.HiveContext
val ctx = new HiveContext(sc)
import ctx.implicits._
val df = ctx.jsonFile("file:///Users/sim/dev/spx/data/view-clicks-training/2015/06/18/part-00000.gz")
df.registerTempTable("training")
val dfCount = ctx.sql("select count(*) as cnt from training")
println(dfCount.first.getLong(0))
➜ dev SPARK_REPL_OPTS="-XX:MaxPermSize=256m" spark-1.3.1-bin-hadoop2.6/bin/spark-shell --packages com.databricks:spark-csv_2.10:1.0.3 --driver-memory 4g --executor-memory 4g
Ivy Default Cache set to: /Users/sim/.ivy2/cache
The jars for the packages stored in: /Users/sim/.ivy2/jars
:: loading settings :: url = jar:file:/Users/sim/dev/spark-1.3.1-bin-hadoop2.6/lib/spark-assembly-1.3.1-hadoop2.6.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
com.databricks#spark-csv_2.10 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent;1.0
confs: [default]
found com.databricks#spark-csv_2.10;1.0.3 in central
found org.apache.commons#commons-csv;1.1 in central
:: resolution report :: resolve 195ms :: artifacts dl 5ms
:: modules in use:
com.databricks#spark-csv_2.10;1.0.3 from central in [default]
org.apache.commons#commons-csv;1.1 from central in [default]
---------------------------------------------------------------------
| | modules || artifacts |
| conf | number| search|dwnlded|evicted|| number|dwnlded|
---------------------------------------------------------------------
| default | 2 | 0 | 0 | 0 || 2 | 0 |
---------------------------------------------------------------------
:: retrieving :: org.apache.spark#spark-submit-parent
confs: [default]
0 artifacts copied, 2 already retrieved (0kB/5ms)
2015-07-02 15:29:33.242 java[45393:7905252] Unable to load realm info from SCDynamicStore
15/07/02 15:29:33 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
15/07/02 15:29:33 INFO spark.SecurityManager: Changing view acls to: sim
15/07/02 15:29:33 INFO spark.SecurityManager: Changing modify acls to: sim
15/07/02 15:29:33 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(sim); users with modify permissions: Set(sim)
15/07/02 15:29:33 INFO spark.HttpServer: Starting HTTP Server
15/07/02 15:29:33 INFO server.Server: jetty-8.y.z-SNAPSHOT
15/07/02 15:29:33 INFO server.AbstractConnector: Started SocketConnector@0.0.0.0:62083
15/07/02 15:29:33 INFO util.Utils: Successfully started service 'HTTP class server' on port 62083.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 1.3.1
/_/
Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0_51)
Type in expressions to have them evaluated.
Type :help for more information.
15/07/02 15:29:36 INFO spark.SparkContext: Running Spark version 1.3.1
15/07/02 15:29:36 INFO spark.SecurityManager: Changing view acls to: sim
15/07/02 15:29:36 INFO spark.SecurityManager: Changing modify acls to: sim
15/07/02 15:29:36 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(sim); users with modify permissions: Set(sim)
15/07/02 15:29:36 INFO slf4j.Slf4jLogger: Slf4jLogger started
15/07/02 15:29:36 INFO Remoting: Starting remoting
15/07/02 15:29:36 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriver@192.168.1.12:62084]
15/07/02 15:29:36 INFO util.Utils: Successfully started service 'sparkDriver' on port 62084.
15/07/02 15:29:36 INFO spark.SparkEnv: Registering MapOutputTracker
15/07/02 15:29:36 INFO spark.SparkEnv: Registering BlockManagerMaster
15/07/02 15:29:36 INFO storage.DiskBlockManager: Created local directory at /var/folders/ln/j4dkd3bd07d_7tzqc843y2jw0000gn/T/spark-0de5dce8-23bf-4dab-849e-f3e55e083747/blockmgr-55d47ebf-9987-4f9b-ac3b-02537c0e86ba
15/07/02 15:29:36 INFO storage.MemoryStore: MemoryStore started with capacity 2.1 GB
15/07/02 15:29:36 INFO spark.HttpFileServer: HTTP File server directory is /var/folders/ln/j4dkd3bd07d_7tzqc843y2jw0000gn/T/spark-b8b6bbb8-13cc-4c7e-9696-2f3be90b54c6/httpd-9730dc96-ceb4-410a-aba0-967216cec688
15/07/02 15:29:36 INFO spark.HttpServer: Starting HTTP Server
15/07/02 15:29:36 INFO server.Server: jetty-8.y.z-SNAPSHOT
15/07/02 15:29:36 INFO server.AbstractConnector: Started SocketConnector@0.0.0.0:62085
15/07/02 15:29:36 INFO util.Utils: Successfully started service 'HTTP file server' on port 62085.
15/07/02 15:29:36 INFO spark.SparkEnv: Registering OutputCommitCoordinator
15/07/02 15:29:36 INFO server.Server: jetty-8.y.z-SNAPSHOT
15/07/02 15:29:36 INFO server.AbstractConnector: Started SelectChannelConnector@0.0.0.0:4040
15/07/02 15:29:36 INFO util.Utils: Successfully started service 'SparkUI' on port 4040.
15/07/02 15:29:36 INFO ui.SparkUI: Started SparkUI at http://192.168.1.12:4040
15/07/02 15:29:36 INFO spark.SparkContext: Added JAR file:/Users/sim/.ivy2/jars/spark-csv_2.10.jar at http://192.168.1.12:62085/jars/spark-csv_2.10.jar with timestamp 1435865376837
15/07/02 15:29:36 INFO spark.SparkContext: Added JAR file:/Users/sim/.ivy2/jars/commons-csv.jar at http://192.168.1.12:62085/jars/commons-csv.jar with timestamp 1435865376838
15/07/02 15:29:36 INFO executor.Executor: Starting executor ID <driver> on host localhost
15/07/02 15:29:36 INFO executor.Executor: Using REPL class URI: http://192.168.1.12:62083
15/07/02 15:29:36 INFO util.AkkaUtils: Connecting to HeartbeatReceiver: akka.tcp://sparkDriver@192.168.1.12:62084/user/HeartbeatReceiver
15/07/02 15:29:36 INFO netty.NettyBlockTransferService: Server created on 62086
15/07/02 15:29:36 INFO storage.BlockManagerMaster: Trying to register BlockManager
15/07/02 15:29:36 INFO storage.BlockManagerMasterActor: Registering block manager localhost:62086 with 2.1 GB RAM, BlockManagerId(<driver>, localhost, 62086)
15/07/02 15:29:36 INFO storage.BlockManagerMaster: Registered BlockManager
15/07/02 15:29:37 INFO repl.SparkILoop: Created spark context..
Spark context available as sc.
15/07/02 15:29:37 INFO repl.SparkILoop: Created sql context (with Hive support)..
SQL context available as sqlContext.
scala> import org.apache.spark.sql.hive.HiveContext
import org.apache.spark.sql.hive.HiveContext
scala>
scala> val ctx = new HiveContext(sc)
ctx: org.apache.spark.sql.hive.HiveContext = org.apache.spark.sql.hive.HiveContext@2e46890e
scala> import ctx.implicits._
import ctx.implicits._
scala>
scala> val df = ctx.jsonFile("file:///Users/sim/dev/spx/data/view-clicks-training/2015/06/18/part-00000.gz")
15/07/02 15:29:52 INFO storage.MemoryStore: ensureFreeSpace(183601) called with curMem=0, maxMem=2223023063
15/07/02 15:29:52 INFO storage.MemoryStore: Block broadcast_0 stored as values in memory (estimated size 179.3 KB, free 2.1 GB)
15/07/02 15:29:52 INFO storage.MemoryStore: ensureFreeSpace(26218) called with curMem=183601, maxMem=2223023063
15/07/02 15:29:52 INFO storage.MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 25.6 KB, free 2.1 GB)
15/07/02 15:29:52 INFO storage.BlockManagerInfo: Added broadcast_0_piece0 in memory on localhost:62086 (size: 25.6 KB, free: 2.1 GB)
15/07/02 15:29:52 INFO storage.BlockManagerMaster: Updated info of block broadcast_0_piece0
15/07/02 15:29:52 INFO spark.SparkContext: Created broadcast 0 from textFile at JSONRelation.scala:114
15/07/02 15:29:52 INFO mapred.FileInputFormat: Total input paths to process : 1
15/07/02 15:29:52 INFO spark.SparkContext: Starting job: isEmpty at JsonRDD.scala:51
15/07/02 15:29:52 INFO scheduler.DAGScheduler: Got job 0 (isEmpty at JsonRDD.scala:51) with 1 output partitions (allowLocal=true)
15/07/02 15:29:52 INFO scheduler.DAGScheduler: Final stage: Stage 0(isEmpty at JsonRDD.scala:51)
15/07/02 15:29:52 INFO scheduler.DAGScheduler: Parents of final stage: List()
15/07/02 15:29:52 INFO scheduler.DAGScheduler: Missing parents: List()
15/07/02 15:29:52 INFO scheduler.DAGScheduler: Submitting Stage 0 (file:///Users/sim/dev/spx/data/view-clicks-training/2015/06/18/part-00000.gz MapPartitionsRDD[1] at textFile at JSONRelation.scala:114), which has no missing parents
15/07/02 15:29:52 INFO storage.MemoryStore: ensureFreeSpace(2728) called with curMem=209819, maxMem=2223023063
15/07/02 15:29:52 INFO storage.MemoryStore: Block broadcast_1 stored as values in memory (estimated size 2.7 KB, free 2.1 GB)
15/07/02 15:29:52 INFO storage.MemoryStore: ensureFreeSpace(2031) called with curMem=212547, maxMem=2223023063
15/07/02 15:29:52 INFO storage.MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 2031.0 B, free 2.1 GB)
15/07/02 15:29:52 INFO storage.BlockManagerInfo: Added broadcast_1_piece0 in memory on localhost:62086 (size: 2031.0 B, free: 2.1 GB)
15/07/02 15:29:52 INFO storage.BlockManagerMaster: Updated info of block broadcast_1_piece0
15/07/02 15:29:52 INFO spark.SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:839
15/07/02 15:29:52 INFO scheduler.DAGScheduler: Submitting 1 missing tasks from Stage 0 (file:///Users/sim/dev/spx/data/view-clicks-training/2015/06/18/part-00000.gz MapPartitionsRDD[1] at textFile at JSONRelation.scala:114)
15/07/02 15:29:52 INFO scheduler.TaskSchedulerImpl: Adding task set 0.0 with 1 tasks
15/07/02 15:29:52 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, localhost, PROCESS_LOCAL, 1453 bytes)
15/07/02 15:29:52 INFO executor.Executor: Running task 0.0 in stage 0.0 (TID 0)
15/07/02 15:29:52 INFO executor.Executor: Fetching http://192.168.1.12:62085/jars/commons-csv.jar with timestamp 1435865376838
15/07/02 15:29:52 INFO util.Utils: Fetching http://192.168.1.12:62085/jars/commons-csv.jar to /var/folders/ln/j4dkd3bd07d_7tzqc843y2jw0000gn/T/spark-fd72e62c-1adf-4bad-8c3d-5b3899545675/userFiles-c8f3949e-7f5e-43c3-b1ef-8f22523bdbcc/fetchFileTemp2464132617259806671.tmp
15/07/02 15:29:52 INFO executor.Executor: Adding file:/var/folders/ln/j4dkd3bd07d_7tzqc843y2jw0000gn/T/spark-fd72e62c-1adf-4bad-8c3d-5b3899545675/userFiles-c8f3949e-7f5e-43c3-b1ef-8f22523bdbcc/commons-csv.jar to class loader
15/07/02 15:29:52 INFO executor.Executor: Fetching http://192.168.1.12:62085/jars/spark-csv_2.10.jar with timestamp 1435865376837
15/07/02 15:29:52 INFO util.Utils: Fetching http://192.168.1.12:62085/jars/spark-csv_2.10.jar to /var/folders/ln/j4dkd3bd07d_7tzqc843y2jw0000gn/T/spark-fd72e62c-1adf-4bad-8c3d-5b3899545675/userFiles-c8f3949e-7f5e-43c3-b1ef-8f22523bdbcc/fetchFileTemp3554212928556694314.tmp
15/07/02 15:29:52 INFO executor.Executor: Adding file:/var/folders/ln/j4dkd3bd07d_7tzqc843y2jw0000gn/T/spark-fd72e62c-1adf-4bad-8c3d-5b3899545675/userFiles-c8f3949e-7f5e-43c3-b1ef-8f22523bdbcc/spark-csv_2.10.jar to class loader
15/07/02 15:29:52 INFO rdd.HadoopRDD: Input split: file:/Users/sim/dev/spx/data/view-clicks-training/2015/06/18/part-00000.gz:0+22597095
15/07/02 15:29:52 INFO Configuration.deprecation: mapred.tip.id is deprecated. Instead, use mapreduce.task.id
15/07/02 15:29:52 INFO Configuration.deprecation: mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id
15/07/02 15:29:52 INFO Configuration.deprecation: mapred.task.is.map is deprecated. Instead, use mapreduce.task.ismap
15/07/02 15:29:52 INFO Configuration.deprecation: mapred.task.partition is deprecated. Instead, use mapreduce.task.partition
15/07/02 15:29:52 INFO Configuration.deprecation: mapred.job.id is deprecated. Instead, use mapreduce.job.id
15/07/02 15:29:52 INFO compress.CodecPool: Got brand-new decompressor [.gz]
15/07/02 15:29:52 INFO executor.Executor: Finished task 0.0 in stage 0.0 (TID 0). 3741 bytes result sent to driver
15/07/02 15:29:52 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 129 ms on localhost (1/1)
15/07/02 15:29:52 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
15/07/02 15:29:52 INFO scheduler.DAGScheduler: Stage 0 (isEmpty at JsonRDD.scala:51) finished in 0.139 s
15/07/02 15:29:52 INFO scheduler.DAGScheduler: Job 0 finished: isEmpty at JsonRDD.scala:51, took 0.172693 s
15/07/02 15:29:52 INFO spark.SparkContext: Starting job: reduce at JsonRDD.scala:54
15/07/02 15:29:52 INFO scheduler.DAGScheduler: Got job 1 (reduce at JsonRDD.scala:54) with 1 output partitions (allowLocal=false)
15/07/02 15:29:52 INFO scheduler.DAGScheduler: Final stage: Stage 1(reduce at JsonRDD.scala:54)
15/07/02 15:29:52 INFO scheduler.DAGScheduler: Parents of final stage: List()
15/07/02 15:29:52 INFO scheduler.DAGScheduler: Missing parents: List()
15/07/02 15:29:52 INFO scheduler.DAGScheduler: Submitting Stage 1 (MapPartitionsRDD[3] at map at JsonRDD.scala:54), which has no missing parents
15/07/02 15:29:52 INFO storage.MemoryStore: ensureFreeSpace(3240) called with curMem=214578, maxMem=2223023063
15/07/02 15:29:52 INFO storage.MemoryStore: Block broadcast_2 stored as values in memory (estimated size 3.2 KB, free 2.1 GB)
15/07/02 15:29:52 INFO storage.MemoryStore: ensureFreeSpace(2338) called with curMem=217818, maxMem=2223023063
15/07/02 15:29:52 INFO storage.MemoryStore: Block broadcast_2_piece0 stored as bytes in memory (estimated size 2.3 KB, free 2.1 GB)
15/07/02 15:29:52 INFO storage.BlockManagerInfo: Added broadcast_2_piece0 in memory on localhost:62086 (size: 2.3 KB, free: 2.1 GB)
15/07/02 15:29:52 INFO storage.BlockManagerMaster: Updated info of block broadcast_2_piece0
15/07/02 15:29:52 INFO spark.SparkContext: Created broadcast 2 from broadcast at DAGScheduler.scala:839
15/07/02 15:29:52 INFO scheduler.DAGScheduler: Submitting 1 missing tasks from Stage 1 (MapPartitionsRDD[3] at map at JsonRDD.scala:54)
15/07/02 15:29:52 INFO scheduler.TaskSchedulerImpl: Adding task set 1.0 with 1 tasks
15/07/02 15:29:52 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 1.0 (TID 1, localhost, PROCESS_LOCAL, 1453 bytes)
15/07/02 15:29:52 INFO executor.Executor: Running task 0.0 in stage 1.0 (TID 1)
15/07/02 15:29:52 INFO rdd.HadoopRDD: Input split: file:/Users/sim/dev/spx/data/view-clicks-training/2015/06/18/part-00000.gz:0+22597095
15/07/02 15:29:52 INFO compress.CodecPool: Got brand-new decompressor [.gz]
15/07/02 15:29:54 INFO storage.BlockManager: Removing broadcast 1
15/07/02 15:29:54 INFO storage.BlockManager: Removing block broadcast_1_piece0
15/07/02 15:29:54 INFO storage.MemoryStore: Block broadcast_1_piece0 of size 2031 dropped from memory (free 2222804938)
15/07/02 15:29:54 INFO storage.BlockManagerInfo: Removed broadcast_1_piece0 on localhost:62086 in memory (size: 2031.0 B, free: 2.1 GB)
15/07/02 15:29:54 INFO storage.BlockManagerMaster: Updated info of block broadcast_1_piece0
15/07/02 15:29:54 INFO storage.BlockManager: Removing block broadcast_1
15/07/02 15:29:54 INFO storage.MemoryStore: Block broadcast_1 of size 2728 dropped from memory (free 2222807666)
15/07/02 15:29:54 INFO spark.ContextCleaner: Cleaned broadcast 1
15/07/02 15:30:06 INFO executor.Executor: Finished task 0.0 in stage 1.0 (TID 1). 6638 bytes result sent to driver
15/07/02 15:30:06 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 1.0 (TID 1) in 13740 ms on localhost (1/1)
15/07/02 15:30:06 INFO scheduler.DAGScheduler: Stage 1 (reduce at JsonRDD.scala:54) finished in 13.744 s
15/07/02 15:30:06 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks have all completed, from pool
15/07/02 15:30:06 INFO scheduler.DAGScheduler: Job 1 finished: reduce at JsonRDD.scala:54, took 13.753319 s
15/07/02 15:30:06 INFO metastore.HiveMetaStore: 0: Opening raw store with implemenation class:org.apache.hadoop.hive.metastore.ObjectStore
15/07/02 15:30:06 INFO metastore.ObjectStore: ObjectStore, initialize called
15/07/02 15:30:06 INFO DataNucleus.Persistence: Property datanucleus.cache.level2 unknown - will be ignored
15/07/02 15:30:06 INFO DataNucleus.Persistence: Property hive.metastore.integral.jdo.pushdown unknown - will be ignored
15/07/02 15:30:06 WARN DataNucleus.Connection: BoneCP specified but not present in CLASSPATH (or one of dependencies)
15/07/02 15:30:07 WARN DataNucleus.Connection: BoneCP specified but not present in CLASSPATH (or one of dependencies)
15/07/02 15:30:07 INFO metastore.ObjectStore: Setting MetaStore object pin classes with hive.metastore.cache.pinobjtypes="Table,StorageDescriptor,SerDeInfo,Partition,Database,Type,FieldSchema,Order"
15/07/02 15:30:07 INFO metastore.MetaStoreDirectSql: MySQL check failed, assuming we are not on mysql: Lexical error at line 1, column 5. Encountered: "@" (64), after : "".
15/07/02 15:30:08 INFO DataNucleus.Datastore: The class "org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as "embedded-only" so does not have its own datastore table.
15/07/02 15:30:08 INFO DataNucleus.Datastore: The class "org.apache.hadoop.hive.metastore.model.MOrder" is tagged as "embedded-only" so does not have its own datastore table.
15/07/02 15:30:08 INFO DataNucleus.Datastore: The class "org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as "embedded-only" so does not have its own datastore table.
15/07/02 15:30:08 INFO DataNucleus.Datastore: The class "org.apache.hadoop.hive.metastore.model.MOrder" is tagged as "embedded-only" so does not have its own datastore table.
15/07/02 15:30:08 INFO DataNucleus.Query: Reading in results for query "org.datanucleus.store.rdbms.query.SQLQuery@0" since the connection used is closing
15/07/02 15:30:08 INFO metastore.ObjectStore: Initialized ObjectStore
15/07/02 15:30:08 INFO metastore.HiveMetaStore: Added admin role in metastore
15/07/02 15:30:08 INFO metastore.HiveMetaStore: Added public role in metastore
15/07/02 15:30:08 INFO metastore.HiveMetaStore: No user is added in admin role, since config is empty
15/07/02 15:30:08 INFO session.SessionState: No Tez session required at this point. hive.execution.engine=mr.
15/07/02 15:30:08 INFO session.SessionState: No Tez session required at this point. hive.execution.engine=mr.
df: org.apache.spark.sql.DataFrame = [aac_brand: string, aag__id: bigint, aag_weight: bigint, aca_brand: string, aca_conversion_integration: boolean, aca_daily_budget: bigint, aca_hide_brand_from_publishers: boolean, aca_is_remnant: boolean, aca_short_name: string, accid: string, acr__id: bigint, acr_choices: array<struct<cta:string,headline:string,img:string,target:string>>, acr_cta: string, acr_description1: string, acr_description2: string, acr_destination: string, acr_displayUrl: string, acr_headline: string, acr_img: string, acr_isiUrl: string, acr_paramCTA: string, acr_paramName: string, acr_paramPlaceholder: string, acr_target: string, acr_type: string, acr_weight: bigint, agid: string, akw__id: bigint, akw_canonical_id: bigint, akw_criterion_type: string, akw_destination_url: st...
scala> df.registerTempTable("training")
scala>
scala> val dfCount = ctx.sql("select count(*) as cnt from training")
15/07/02 15:30:09 INFO parse.ParseDriver: Parsing command: select count(*) as cnt from training
15/07/02 15:30:09 INFO parse.ParseDriver: Parse Completed
dfCount: org.apache.spark.sql.DataFrame = [cnt: bigint]
scala> println(dfCount.first.getLong(0))
15/07/02 15:30:09 INFO storage.MemoryStore: ensureFreeSpace(90479) called with curMem=215397, maxMem=2223023063
15/07/02 15:30:09 INFO storage.MemoryStore: Block broadcast_3 stored as values in memory (estimated size 88.4 KB, free 2.1 GB)
15/07/02 15:30:09 INFO storage.MemoryStore: ensureFreeSpace(36868) called with curMem=305876, maxMem=2223023063
15/07/02 15:30:09 INFO storage.MemoryStore: Block broadcast_3_piece0 stored as bytes in memory (estimated size 36.0 KB, free 2.1 GB)
15/07/02 15:30:09 INFO storage.BlockManagerInfo: Added broadcast_3_piece0 in memory on localhost:62086 (size: 36.0 KB, free: 2.1 GB)
15/07/02 15:30:09 INFO storage.BlockManagerMaster: Updated info of block broadcast_3_piece0
15/07/02 15:30:09 INFO spark.SparkContext: Created broadcast 3 from textFile at JSONRelation.scala:114
15/07/02 15:30:09 INFO spark.SparkContext: Starting job: runJob at SparkPlan.scala:122
15/07/02 15:30:09 INFO mapred.FileInputFormat: Total input paths to process : 1
15/07/02 15:30:09 INFO scheduler.DAGScheduler: Registering RDD 10 (mapPartitions at Exchange.scala:101)
15/07/02 15:30:09 INFO scheduler.DAGScheduler: Got job 2 (runJob at SparkPlan.scala:122) with 1 output partitions (allowLocal=false)
15/07/02 15:30:09 INFO scheduler.DAGScheduler: Final stage: Stage 3(runJob at SparkPlan.scala:122)
15/07/02 15:30:09 INFO scheduler.DAGScheduler: Parents of final stage: List(Stage 2)
15/07/02 15:30:09 INFO scheduler.DAGScheduler: Missing parents: List(Stage 2)
15/07/02 15:30:09 INFO scheduler.DAGScheduler: Submitting Stage 2 (MapPartitionsRDD[10] at mapPartitions at Exchange.scala:101), which has no missing parents
15/07/02 15:30:09 INFO storage.MemoryStore: ensureFreeSpace(17448) called with curMem=342744, maxMem=2223023063
15/07/02 15:30:09 INFO storage.MemoryStore: Block broadcast_4 stored as values in memory (estimated size 17.0 KB, free 2.1 GB)
15/07/02 15:30:09 INFO storage.MemoryStore: ensureFreeSpace(9310) called with curMem=360192, maxMem=2223023063
15/07/02 15:30:09 INFO storage.MemoryStore: Block broadcast_4_piece0 stored as bytes in memory (estimated size 9.1 KB, free 2.1 GB)
15/07/02 15:30:09 INFO storage.BlockManagerInfo: Added broadcast_4_piece0 in memory on localhost:62086 (size: 9.1 KB, free: 2.1 GB)
15/07/02 15:30:09 INFO storage.BlockManagerMaster: Updated info of block broadcast_4_piece0
15/07/02 15:30:09 INFO spark.SparkContext: Created broadcast 4 from broadcast at DAGScheduler.scala:839
15/07/02 15:30:09 INFO scheduler.DAGScheduler: Submitting 1 missing tasks from Stage 2 (MapPartitionsRDD[10] at mapPartitions at Exchange.scala:101)
15/07/02 15:30:09 INFO scheduler.TaskSchedulerImpl: Adding task set 2.0 with 1 tasks
15/07/02 15:30:09 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 2.0 (TID 2, localhost, PROCESS_LOCAL, 1442 bytes)
15/07/02 15:30:09 INFO executor.Executor: Running task 0.0 in stage 2.0 (TID 2)
15/07/02 15:30:09 INFO rdd.HadoopRDD: Input split: file:/Users/sim/dev/spx/data/view-clicks-training/2015/06/18/part-00000.gz:0+22597095
15/07/02 15:30:09 INFO compress.CodecPool: Got brand-new decompressor [.gz]
15/07/02 15:30:15 INFO executor.Executor: Finished task 0.0 in stage 2.0 (TID 2). 2003 bytes result sent to driver
15/07/02 15:30:15 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 2.0 (TID 2) in 5081 ms on localhost (1/1)
15/07/02 15:30:15 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 2.0, whose tasks have all completed, from pool
15/07/02 15:30:15 INFO scheduler.DAGScheduler: Stage 2 (mapPartitions at Exchange.scala:101) finished in 5.081 s
15/07/02 15:30:15 INFO scheduler.DAGScheduler: looking for newly runnable stages
15/07/02 15:30:15 INFO scheduler.DAGScheduler: running: Set()
15/07/02 15:30:15 INFO scheduler.DAGScheduler: waiting: Set(Stage 3)
15/07/02 15:30:15 INFO scheduler.DAGScheduler: failed: Set()
15/07/02 15:30:15 INFO scheduler.DAGScheduler: Missing parents for Stage 3: List()
15/07/02 15:30:15 INFO scheduler.DAGScheduler: Submitting Stage 3 (MapPartitionsRDD[14] at map at SparkPlan.scala:97), which is now runnable
15/07/02 15:30:15 INFO storage.MemoryStore: ensureFreeSpace(18920) called with curMem=369502, maxMem=2223023063
15/07/02 15:30:15 INFO storage.MemoryStore: Block broadcast_5 stored as values in memory (estimated size 18.5 KB, free 2.1 GB)
15/07/02 15:30:15 INFO storage.MemoryStore: ensureFreeSpace(10501) called with curMem=388422, maxMem=2223023063
15/07/02 15:30:15 INFO storage.MemoryStore: Block broadcast_5_piece0 stored as bytes in memory (estimated size 10.3 KB, free 2.1 GB)
15/07/02 15:30:15 INFO storage.BlockManagerInfo: Added broadcast_5_piece0 in memory on localhost:62086 (size: 10.3 KB, free: 2.1 GB)
15/07/02 15:30:15 INFO storage.BlockManagerMaster: Updated info of block broadcast_5_piece0
15/07/02 15:30:15 INFO spark.SparkContext: Created broadcast 5 from broadcast at DAGScheduler.scala:839
15/07/02 15:30:15 INFO scheduler.DAGScheduler: Submitting 1 missing tasks from Stage 3 (MapPartitionsRDD[14] at map at SparkPlan.scala:97)
15/07/02 15:30:15 INFO scheduler.TaskSchedulerImpl: Adding task set 3.0 with 1 tasks
15/07/02 15:30:15 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 3.0 (TID 3, localhost, PROCESS_LOCAL, 1171 bytes)
15/07/02 15:30:15 INFO executor.Executor: Running task 0.0 in stage 3.0 (TID 3)
15/07/02 15:30:15 INFO storage.ShuffleBlockFetcherIterator: Getting 1 non-empty blocks out of 1 blocks
15/07/02 15:30:15 INFO storage.ShuffleBlockFetcherIterator: Started 0 remote fetches in 2 ms
15/07/02 15:30:15 INFO executor.Executor: Finished task 0.0 in stage 3.0 (TID 3). 1115 bytes result sent to driver
15/07/02 15:30:15 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 3.0 (TID 3) in 52 ms on localhost (1/1)
15/07/02 15:30:15 INFO scheduler.DAGScheduler: Stage 3 (runJob at SparkPlan.scala:122) finished in 0.052 s
15/07/02 15:30:15 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 3.0, whose tasks have all completed, from pool
15/07/02 15:30:15 INFO scheduler.DAGScheduler: Job 2 finished: runJob at SparkPlan.scala:122, took 5.168569 s
88283
scala>
➜ dev SPARK_REPL_OPTS="-XX:MaxPermSize=256m" spark-1.4.0-bin-hadoop2.6/bin/spark-shell --packages com.databricks:spark-csv_2.10:1.0.3 --driver-memory 4g --executor-memory 4g
Ivy Default Cache set to: /Users/sim/.ivy2/cache
The jars for the packages stored in: /Users/sim/.ivy2/jars
:: loading settings :: url = jar:file:/Users/sim/dev/spark-1.4.0-bin-hadoop2.6/lib/spark-assembly-1.4.0-hadoop2.6.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
com.databricks#spark-csv_2.10 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent;1.0
confs: [default]
found com.databricks#spark-csv_2.10;1.0.3 in central
found org.apache.commons#commons-csv;1.1 in central
:: resolution report :: resolve 197ms :: artifacts dl 6ms
:: modules in use:
com.databricks#spark-csv_2.10;1.0.3 from central in [default]
org.apache.commons#commons-csv;1.1 from central in [default]
---------------------------------------------------------------------
| | modules || artifacts |
| conf | number| search|dwnlded|evicted|| number|dwnlded|
---------------------------------------------------------------------
| default | 2 | 0 | 0 | 0 || 2 | 0 |
---------------------------------------------------------------------
:: retrieving :: org.apache.spark#spark-submit-parent
confs: [default]
0 artifacts copied, 2 already retrieved (0kB/5ms)
2015-07-02 15:26:40.861 java[45131:7902534] Unable to load realm info from SCDynamicStore
15/07/02 15:26:41 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
15/07/02 15:26:41 INFO spark.SecurityManager: Changing view acls to: sim
15/07/02 15:26:41 INFO spark.SecurityManager: Changing modify acls to: sim
15/07/02 15:26:41 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(sim); users with modify permissions: Set(sim)
15/07/02 15:26:41 INFO spark.HttpServer: Starting HTTP Server
15/07/02 15:26:41 INFO server.Server: jetty-8.y.z-SNAPSHOT
15/07/02 15:26:41 INFO server.AbstractConnector: Started SocketConnector@0.0.0.0:62062
15/07/02 15:26:41 INFO util.Utils: Successfully started service 'HTTP class server' on port 62062.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 1.4.0
/_/
Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0_51)
Type in expressions to have them evaluated.
Type :help for more information.
15/07/02 15:26:43 INFO spark.SparkContext: Running Spark version 1.4.0
15/07/02 15:26:43 INFO spark.SecurityManager: Changing view acls to: sim
15/07/02 15:26:43 INFO spark.SecurityManager: Changing modify acls to: sim
15/07/02 15:26:43 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(sim); users with modify permissions: Set(sim)
15/07/02 15:26:44 INFO slf4j.Slf4jLogger: Slf4jLogger started
15/07/02 15:26:44 INFO Remoting: Starting remoting
15/07/02 15:26:44 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriver@192.168.1.12:62063]
15/07/02 15:26:44 INFO util.Utils: Successfully started service 'sparkDriver' on port 62063.
15/07/02 15:26:44 INFO spark.SparkEnv: Registering MapOutputTracker
15/07/02 15:26:44 INFO spark.SparkEnv: Registering BlockManagerMaster
15/07/02 15:26:44 INFO storage.DiskBlockManager: Created local directory at /private/var/folders/ln/j4dkd3bd07d_7tzqc843y2jw0000gn/T/spark-9cfb3e45-eb00-4c59-87ff-964aa164cb70/blockmgr-4f947c98-c8c8-444b-8880-facb39129672
15/07/02 15:26:44 INFO storage.MemoryStore: MemoryStore started with capacity 2.1 GB
15/07/02 15:26:44 INFO spark.HttpFileServer: HTTP File server directory is /private/var/folders/ln/j4dkd3bd07d_7tzqc843y2jw0000gn/T/spark-9cfb3e45-eb00-4c59-87ff-964aa164cb70/httpd-19ef82f1-3f2d-47e2-a5fa-9fa2b7854062
15/07/02 15:26:44 INFO spark.HttpServer: Starting HTTP Server
15/07/02 15:26:44 INFO server.Server: jetty-8.y.z-SNAPSHOT
15/07/02 15:26:44 INFO server.AbstractConnector: Started SocketConnector@0.0.0.0:62064
15/07/02 15:26:44 INFO util.Utils: Successfully started service 'HTTP file server' on port 62064.
15/07/02 15:26:44 INFO spark.SparkEnv: Registering OutputCommitCoordinator
15/07/02 15:26:44 INFO server.Server: jetty-8.y.z-SNAPSHOT
15/07/02 15:26:44 INFO server.AbstractConnector: Started SelectChannelConnector@0.0.0.0:4040
15/07/02 15:26:44 INFO util.Utils: Successfully started service 'SparkUI' on port 4040.
15/07/02 15:26:44 INFO ui.SparkUI: Started SparkUI at http://192.168.1.12:4040
15/07/02 15:26:44 INFO spark.SparkContext: Added JAR file:/Users/sim/.ivy2/jars/com.databricks_spark-csv_2.10-1.0.3.jar at http://192.168.1.12:62064/jars/com.databricks_spark-csv_2.10-1.0.3.jar with timestamp 1435865204454
15/07/02 15:26:44 INFO spark.SparkContext: Added JAR file:/Users/sim/.ivy2/jars/org.apache.commons_commons-csv-1.1.jar at http://192.168.1.12:62064/jars/org.apache.commons_commons-csv-1.1.jar with timestamp 1435865204455
15/07/02 15:26:44 INFO executor.Executor: Starting executor ID driver on host localhost
15/07/02 15:26:44 INFO executor.Executor: Using REPL class URI: http://192.168.1.12:62062
15/07/02 15:26:44 INFO util.Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 62065.
15/07/02 15:26:44 INFO netty.NettyBlockTransferService: Server created on 62065
15/07/02 15:26:44 INFO storage.BlockManagerMaster: Trying to register BlockManager
15/07/02 15:26:44 INFO storage.BlockManagerMasterEndpoint: Registering block manager localhost:62065 with 2.1 GB RAM, BlockManagerId(driver, localhost, 62065)
15/07/02 15:26:44 INFO storage.BlockManagerMaster: Registered BlockManager
15/07/02 15:26:44 INFO repl.SparkILoop: Created spark context..
Spark context available as sc.
15/07/02 15:26:45 INFO hive.HiveContext: Initializing execution hive, version 0.13.1
15/07/02 15:26:45 INFO metastore.HiveMetaStore: 0: Opening raw store with implemenation class:org.apache.hadoop.hive.metastore.ObjectStore
15/07/02 15:26:45 INFO metastore.ObjectStore: ObjectStore, initialize called
15/07/02 15:26:45 INFO DataNucleus.Persistence: Property datanucleus.cache.level2 unknown - will be ignored
15/07/02 15:26:45 INFO DataNucleus.Persistence: Property hive.metastore.integral.jdo.pushdown unknown - will be ignored
15/07/02 15:26:45 WARN DataNucleus.Connection: BoneCP specified but not present in CLASSPATH (or one of dependencies)
15/07/02 15:26:45 WARN DataNucleus.Connection: BoneCP specified but not present in CLASSPATH (or one of dependencies)
15/07/02 15:26:46 INFO metastore.ObjectStore: Setting MetaStore object pin classes with hive.metastore.cache.pinobjtypes="Table,StorageDescriptor,SerDeInfo,Partition,Database,Type,FieldSchema,Order"
15/07/02 15:26:46 INFO metastore.MetaStoreDirectSql: MySQL check failed, assuming we are not on mysql: Lexical error at line 1, column 5. Encountered: "@" (64), after : "".
15/07/02 15:26:47 INFO DataNucleus.Datastore: The class "org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as "embedded-only" so does not have its own datastore table.
15/07/02 15:26:47 INFO DataNucleus.Datastore: The class "org.apache.hadoop.hive.metastore.model.MOrder" is tagged as "embedded-only" so does not have its own datastore table.
15/07/02 15:26:47 INFO DataNucleus.Datastore: The class "org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as "embedded-only" so does not have its own datastore table.
15/07/02 15:26:47 INFO DataNucleus.Datastore: The class "org.apache.hadoop.hive.metastore.model.MOrder" is tagged as "embedded-only" so does not have its own datastore table.
15/07/02 15:26:48 INFO metastore.ObjectStore: Initialized ObjectStore
15/07/02 15:26:48 WARN metastore.ObjectStore: Version information not found in metastore. hive.metastore.schema.verification is not enabled so recording the schema version 0.13.1aa
15/07/02 15:26:48 INFO metastore.HiveMetaStore: Added admin role in metastore
15/07/02 15:26:48 INFO metastore.HiveMetaStore: Added public role in metastore
15/07/02 15:26:48 INFO metastore.HiveMetaStore: No user is added in admin role, since config is empty
15/07/02 15:26:48 INFO session.SessionState: No Tez session required at this point. hive.execution.engine=mr.
15/07/02 15:26:48 INFO repl.SparkILoop: Created sql context (with Hive support)..
SQL context available as sqlContext.
scala> import org.apache.spark.sql.hive.HiveContext
import org.apache.spark.sql.hive.HiveContext
scala>
scala> val ctx = new HiveContext(sc)
15/07/02 15:27:06 INFO hive.HiveContext: Initializing execution hive, version 0.13.1
ctx: org.apache.spark.sql.hive.HiveContext = org.apache.spark.sql.hive.HiveContext@6f57b5c0
scala> import ctx.implicits._
import ctx.implicits._
scala>
scala> val df = ctx.jsonFile("file:///Users/sim/dev/spx/data/view-clicks-training/2015/06/18/part-00000.gz")
warning: there were 1 deprecation warning(s); re-run with -deprecation for details
15/07/02 15:27:07 INFO storage.MemoryStore: ensureFreeSpace(89208) called with curMem=0, maxMem=2223023063
15/07/02 15:27:07 INFO storage.MemoryStore: Block broadcast_0 stored as values in memory (estimated size 87.1 KB, free 2.1 GB)
15/07/02 15:27:07 INFO storage.MemoryStore: ensureFreeSpace(20184) called with curMem=89208, maxMem=2223023063
15/07/02 15:27:07 INFO storage.MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 19.7 KB, free 2.1 GB)
15/07/02 15:27:07 INFO storage.BlockManagerInfo: Added broadcast_0_piece0 in memory on localhost:62065 (size: 19.7 KB, free: 2.1 GB)
15/07/02 15:27:07 INFO spark.SparkContext: Created broadcast 0 from jsonFile at <console>:27
15/07/02 15:27:07 INFO mapred.FileInputFormat: Total input paths to process : 1
15/07/02 15:27:07 INFO spark.SparkContext: Starting job: jsonFile at <console>:27
15/07/02 15:27:07 INFO scheduler.DAGScheduler: Got job 0 (jsonFile at <console>:27) with 1 output partitions (allowLocal=false)
15/07/02 15:27:07 INFO scheduler.DAGScheduler: Final stage: ResultStage 0(jsonFile at <console>:27)
15/07/02 15:27:07 INFO scheduler.DAGScheduler: Parents of final stage: List()
15/07/02 15:27:07 INFO scheduler.DAGScheduler: Missing parents: List()
15/07/02 15:27:07 INFO scheduler.DAGScheduler: Submitting ResultStage 0 (MapPartitionsRDD[3] at jsonFile at <console>:27), which has no missing parents
15/07/02 15:27:07 INFO storage.MemoryStore: ensureFreeSpace(4376) called with curMem=109392, maxMem=2223023063
15/07/02 15:27:07 INFO storage.MemoryStore: Block broadcast_1 stored as values in memory (estimated size 4.3 KB, free 2.1 GB)
15/07/02 15:27:07 INFO storage.MemoryStore: ensureFreeSpace(2438) called with curMem=113768, maxMem=2223023063
15/07/02 15:27:07 INFO storage.MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 2.4 KB, free 2.1 GB)
15/07/02 15:27:07 INFO storage.BlockManagerInfo: Added broadcast_1_piece0 in memory on localhost:62065 (size: 2.4 KB, free: 2.1 GB)
15/07/02 15:27:07 INFO spark.SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:874
15/07/02 15:27:07 INFO scheduler.DAGScheduler: Submitting 1 missing tasks from ResultStage 0 (MapPartitionsRDD[3] at jsonFile at <console>:27)
15/07/02 15:27:07 INFO scheduler.TaskSchedulerImpl: Adding task set 0.0 with 1 tasks
15/07/02 15:27:07 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, localhost, PROCESS_LOCAL, 1606 bytes)
15/07/02 15:27:07 INFO executor.Executor: Running task 0.0 in stage 0.0 (TID 0)
15/07/02 15:27:07 INFO executor.Executor: Fetching http://192.168.1.12:62064/jars/com.databricks_spark-csv_2.10-1.0.3.jar with timestamp 1435865204454
15/07/02 15:27:07 INFO util.Utils: Fetching http://192.168.1.12:62064/jars/com.databricks_spark-csv_2.10-1.0.3.jar to /private/var/folders/ln/j4dkd3bd07d_7tzqc843y2jw0000gn/T/spark-9cfb3e45-eb00-4c59-87ff-964aa164cb70/userFiles-07da6cd8-f3ea-45c2-a20f-0a37b4815411/fetchFileTemp7375228441740076891.tmp
15/07/02 15:27:07 INFO executor.Executor: Adding file:/private/var/folders/ln/j4dkd3bd07d_7tzqc843y2jw0000gn/T/spark-9cfb3e45-eb00-4c59-87ff-964aa164cb70/userFiles-07da6cd8-f3ea-45c2-a20f-0a37b4815411/com.databricks_spark-csv_2.10-1.0.3.jar to class loader
15/07/02 15:27:07 INFO executor.Executor: Fetching http://192.168.1.12:62064/jars/org.apache.commons_commons-csv-1.1.jar with timestamp 1435865204455
15/07/02 15:27:07 INFO util.Utils: Fetching http://192.168.1.12:62064/jars/org.apache.commons_commons-csv-1.1.jar to /private/var/folders/ln/j4dkd3bd07d_7tzqc843y2jw0000gn/T/spark-9cfb3e45-eb00-4c59-87ff-964aa164cb70/userFiles-07da6cd8-f3ea-45c2-a20f-0a37b4815411/fetchFileTemp2509865384241222013.tmp
15/07/02 15:27:07 INFO executor.Executor: Adding file:/private/var/folders/ln/j4dkd3bd07d_7tzqc843y2jw0000gn/T/spark-9cfb3e45-eb00-4c59-87ff-964aa164cb70/userFiles-07da6cd8-f3ea-45c2-a20f-0a37b4815411/org.apache.commons_commons-csv-1.1.jar to class loader
15/07/02 15:27:07 INFO rdd.HadoopRDD: Input split: file:/Users/sim/dev/spx/data/view-clicks-training/2015/06/18/part-00000.gz:0+22597095
15/07/02 15:27:07 INFO Configuration.deprecation: mapred.tip.id is deprecated. Instead, use mapreduce.task.id
15/07/02 15:27:07 INFO Configuration.deprecation: mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id
15/07/02 15:27:07 INFO Configuration.deprecation: mapred.task.is.map is deprecated. Instead, use mapreduce.task.ismap
15/07/02 15:27:07 INFO Configuration.deprecation: mapred.task.partition is deprecated. Instead, use mapreduce.task.partition
15/07/02 15:27:07 INFO Configuration.deprecation: mapred.job.id is deprecated. Instead, use mapreduce.job.id
15/07/02 15:27:07 INFO compress.CodecPool: Got brand-new decompressor [.gz]
15/07/02 15:27:18 INFO executor.Executor: Finished task 0.0 in stage 0.0 (TID 0). 7650 bytes result sent to driver
15/07/02 15:27:18 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 11196 ms on localhost (1/1)
15/07/02 15:27:18 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
15/07/02 15:27:18 INFO scheduler.DAGScheduler: ResultStage 0 (jsonFile at <console>:27) finished in 11.204 s
15/07/02 15:27:18 INFO scheduler.DAGScheduler: Job 0 finished: jsonFile at <console>:27, took 11.242330 s
15/07/02 15:27:18 INFO hive.HiveContext: Initializing HiveMetastoreConnection version 0.13.1 using Spark classes.
15/07/02 15:27:19 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
15/07/02 15:27:19 INFO metastore.HiveMetaStore: 0: Opening raw store with implemenation class:org.apache.hadoop.hive.metastore.ObjectStore
15/07/02 15:27:19 INFO metastore.ObjectStore: ObjectStore, initialize called
15/07/02 15:27:19 INFO DataNucleus.Persistence: Property datanucleus.cache.level2 unknown - will be ignored
15/07/02 15:27:19 INFO DataNucleus.Persistence: Property hive.metastore.integral.jdo.pushdown unknown - will be ignored
15/07/02 15:27:19 WARN DataNucleus.Connection: BoneCP specified but not present in CLASSPATH (or one of dependencies)
15/07/02 15:27:19 WARN DataNucleus.Connection: BoneCP specified but not present in CLASSPATH (or one of dependencies)
15/07/02 15:27:19 INFO metastore.ObjectStore: Setting MetaStore object pin classes with hive.metastore.cache.pinobjtypes="Table,StorageDescriptor,SerDeInfo,Partition,Database,Type,FieldSchema,Order"
15/07/02 15:27:19 INFO metastore.MetaStoreDirectSql: MySQL check failed, assuming we are not on mysql: Lexical error at line 1, column 5. Encountered: "@" (64), after : "".
15/07/02 15:27:20 INFO DataNucleus.Datastore: The class "org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as "embedded-only" so does not have its own datastore table.
15/07/02 15:27:20 INFO DataNucleus.Datastore: The class "org.apache.hadoop.hive.metastore.model.MOrder" is tagged as "embedded-only" so does not have its own datastore table.
15/07/02 15:27:20 INFO DataNucleus.Datastore: The class "org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as "embedded-only" so does not have its own datastore table.
15/07/02 15:27:20 INFO DataNucleus.Datastore: The class "org.apache.hadoop.hive.metastore.model.MOrder" is tagged as "embedded-only" so does not have its own datastore table.
15/07/02 15:27:20 INFO DataNucleus.Query: Reading in results for query "org.datanucleus.store.rdbms.query.SQLQuery@0" since the connection used is closing
15/07/02 15:27:20 INFO metastore.ObjectStore: Initialized ObjectStore
15/07/02 15:27:20 INFO metastore.HiveMetaStore: Added admin role in metastore
15/07/02 15:27:20 INFO metastore.HiveMetaStore: Added public role in metastore
15/07/02 15:27:20 INFO metastore.HiveMetaStore: No user is added in admin role, since config is empty
15/07/02 15:27:21 INFO session.SessionState: No Tez session required at this point. hive.execution.engine=mr.
df: org.apache.spark.sql.DataFrame = [aac_brand: string, aag__id: bigint, aag_weight: bigint, aca_brand: string, aca_conversion_integration: boolean, aca_daily_budget: bigint, aca_hide_brand_from_publishers: boolean, aca_is_remnant: boolean, aca_short_name: string, accid: string, acr__id: bigint, acr_choices: array<struct<cta:string,headline:string,img:string,target:string>>, acr_cta: string, acr_description1: string, acr_description2: string, acr_destination: string, acr_displayUrl: string, acr_headline: string, acr_img: string, acr_isiUrl: string, acr_paramCTA: string, acr_paramName: string, acr_paramPlaceholder: string, acr_target: string, acr_type: string, acr_weight: bigint, agid: string, akw__id: bigint, akw_canonical_id: bigint, akw_criterion_type: string, akw_destination_url: st...
scala>
scala> df.registerTempTable("training")
scala>
scala> val dfCount = ctx.sql("select count(*) as cnt from training")
15/07/02 15:27:21 INFO parse.ParseDriver: Parsing command: select count(*) as cnt from training
15/07/02 15:27:22 INFO parse.ParseDriver: Parse Completed
15/07/02 15:27:22 INFO storage.BlockManagerInfo: Removed broadcast_1_piece0 on localhost:62065 in memory (size: 2.4 KB, free: 2.1 GB)
15/07/02 15:27:22 INFO storage.BlockManagerInfo: Removed broadcast_0_piece0 on localhost:62065 in memory (size: 19.7 KB, free: 2.1 GB)
dfCount: org.apache.spark.sql.DataFrame = [cnt: bigint]
scala>
scala> println(dfCount.first.getLong(0))
15/07/02 15:27:24 INFO storage.MemoryStore: ensureFreeSpace(235040) called with curMem=0, maxMem=2223023063
15/07/02 15:27:24 INFO storage.MemoryStore: Block broadcast_2 stored as values in memory (estimated size 229.5 KB, free 2.1 GB)
15/07/02 15:27:24 INFO storage.MemoryStore: ensureFreeSpace(20184) called with curMem=235040, maxMem=2223023063
15/07/02 15:27:24 INFO storage.MemoryStore: Block broadcast_2_piece0 stored as bytes in memory (estimated size 19.7 KB, free 2.1 GB)
15/07/02 15:27:24 INFO storage.BlockManagerInfo: Added broadcast_2_piece0 in memory on localhost:62065 (size: 19.7 KB, free: 2.1 GB)
15/07/02 15:27:24 INFO spark.SparkContext: Created broadcast 2 from first at <console>:30
java.lang.OutOfMemoryError: PermGen space
15/07/02 15:27:52 INFO storage.BlockManagerInfo: Removed broadcast_2_piece0 on localhost:62065 in memory (size: 19.7 KB, free: 2.1 GB)
Stopping spark context.
Exception in thread "main"
Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "main"
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment