Skip to content

Instantly share code, notes, and snippets.

@reachbach
Created June 17, 2014 13:44
Show Gist options
  • Save reachbach/a418ab2f01b639b624c1 to your computer and use it in GitHub Desktop.
Save reachbach/a418ab2f01b639b624c1 to your computer and use it in GitHub Desktop.
Spark mllib LogisticRegressionWithSGD issues
Metric Min 25th Median 75th Max
Result serialization time 12 ms 13 ms 14 ms 16 ms 18 ms
Duration 4 s 4 s 5 s 5 s 5 s
Time spent fetching task 0 ms 0 ms 0 ms 0 ms 0 ms
results
Scheduler delay 6 s 6 s 6 s 6 s 12 s
Stage Id
14 aggregate at GradientDescent.scala:178
Task Index Task ID Status Locality Level Executor Launch Time Duration GC Result Ser Time Errors
Time
0 600 RUNNING PROCESS_LOCAL serious.dataone.foo.bar.com 2014/06/17 10:32:27 1.1 h
1 601 RUNNING PROCESS_LOCAL casual.dataone.foo.bar.com 2014/06/17 10:32:27 1.1 h
2 602 RUNNING PROCESS_LOCAL serious.dataone.foo.bar.com 2014/06/17 10:32:27 1.1 h
3 603 RUNNING PROCESS_LOCAL casual.dataone.foo.bar.com 2014/06/17 10:32:27 1.1 h
4 604 RUNNING PROCESS_LOCAL serious.dataone.foo.bar.com 2014/06/17 10:32:27 1.1 h
5 605 SUCCESS PROCESS_LOCAL casual.dataone.foo.bar.com 2014/06/17 10:32:27 4 s 2 s 12 ms
6 606 SUCCESS PROCESS_LOCAL serious.dataone.foo.bar.com 2014/06/17 10:32:27 4 s 1 s 14 ms
7 607 SUCCESS PROCESS_LOCAL casual.dataone.foo.bar.com 2014/06/17 10:32:27 4 s 2 s 12 ms
8 608 SUCCESS PROCESS_LOCAL serious.dataone.foo.bar.com 2014/06/17 10:32:27 5 s 1 s 15 ms
9 609 SUCCESS PROCESS_LOCAL casual.dataone.foo.bar.com 2014/06/17 10:32:27 5 s 1 s 14 ms
10 610 SUCCESS PROCESS_LOCAL serious.dataone.foo.bar.com 2014/06/17 10:32:27 5 s 1 s 15 ms
11 611 SUCCESS PROCESS_LOCAL casual.dataone.foo.bar.com 2014/06/17 10:32:27 4 s 1 s 13 ms
12 612 SUCCESS PROCESS_LOCAL serious.dataone.foo.bar.com 2014/06/17 10:32:27 5 s 1 s 18 ms
13 613 SUCCESS PROCESS_LOCAL casual.dataone.foo.bar.com 2014/06/17 10:32:27 5 s 1 s 13 ms
14 614 SUCCESS PROCESS_LOCAL serious.dataone.foo.bar.com 2014/06/17 10:32:27 4 s 1 s 14 ms
15 615 SUCCESS PROCESS_LOCAL casual.dataone.foo.bar.com 2014/06/17 10:32:27 4 s 1 s 12 ms
16 616 SUCCESS PROCESS_LOCAL serious.dataone.foo.bar.com 2014/06/17 10:32:27 5 s 1 s 15 ms
17 617 SUCCESS PROCESS_LOCAL casual.dataone.foo.bar.com 2014/06/17 10:32:27 5 s 1 s 18 ms
18 618 SUCCESS PROCESS_LOCAL serious.dataone.foo.bar.com 2014/06/17 10:32:27 5 s 1 s 16 ms
19 619 SUCCESS PROCESS_LOCAL casual.dataone.foo.bar.com 2014/06/17 10:32:27 4 s 1 s 18 ms
Executor stats:
RDD Blocks Memory Used Disk Used Active Tasks Failed Tasks Complete Tasks Total Tasks Task Time Shuffle Read Shuffle Write
0 0.0 B / 6.7 GB 0.0 B 2 0 307 309 23.2 m 0.0 B 0.0 B
0 0.0 B / 6.7 GB 0.0 B 3 0 308 311 22.4 m 0.0 B 0.0 B
Executor jmap output:
Server compiler detected.
JVM version is 24.55-b03
using thread-local object allocation.
Parallel GC with 18 thread(s)
Heap Configuration:
MinHeapFreeRatio = 40
MaxHeapFreeRatio = 70
MaxHeapSize = 10737418240 (10240.0MB)
NewSize = 1310720 (1.25MB)
MaxNewSize = 17592186044415 MB
OldSize = 5439488 (5.1875MB)
NewRatio = 2
SurvivorRatio = 8
PermSize = 21757952 (20.75MB)
MaxPermSize = 134217728 (128.0MB)
G1HeapRegionSize = 0 (0.0MB)
Heap Usage:
PS Young Generation
Eden Space:
capacity = 2783969280 (2655.0MB)
used = 192583816 (183.66223907470703MB)
free = 2591385464 (2471.337760925293MB)
6.917598458557704% used
From Space:
capacity = 409993216 (391.0MB)
used = 1179808 (1.125152587890625MB)
free = 408813408 (389.8748474121094MB)
0.2877628102022059% used
To Space:
capacity = 385351680 (367.5MB)
used = 0 (0.0MB)
free = 385351680 (367.5MB)
0.0% used
PS Old Generation
capacity = 7158628352 (6827.0MB)
used = 4455093024 (4248.707794189453MB)
free = 2703535328 (2578.292205810547MB)
62.2338918146983% used
PS Perm Generation
capacity = 90701824 (86.5MB)
used = 45348832 (43.248016357421875MB)
free = 45352992 (43.251983642578125MB)
49.99770677158598% used
8432 interned Strings occupying 714672 bytes.
Executor GC log snippet:
168.778: [GC [PSYoungGen: 2702831K->578545K(2916864K)] 9302453K->7460857K(9907712K), 0.3193550 secs] [Times: user=5.13 sys=0.39, real=0.32 secs]
169.097: [Full GC [PSYoungGen: 578545K->0K(2916864K)] [ParOldGen: 6882312K->1073297K(6990848K)] 7460857K->1073297K(9907712K) [PSPermGen: 44248K->44201K(88576K)], 4.5521090 secs] [Times: user=24.22 sys=0.18, real=4.55 secs]
174.207: [GC [PSYoungGen: 2338304K->81315K(2544128K)] 3411653K->1154665K(9534976K), 0.0966280 secs] [Times: user=1.66 sys=0.00, real=0.09 secs]
I tried to map partitions to cores on the nodes. Increasing the number of partitions (say to 80 or 100) would result in progress till the 6th iteration or so, but the next stage would stall as before with apparent root cause / logs. With increased partitions, the last stage that completed had the following task times:
Metric Min 25th Median 75th Max
Result serialization time 11 ms 12 ms 13 ms 15 ms 0.4 s
Duration 0.5 s 0.9 s 1 s 3 s 7 s
Time spent fetching 0 ms 0 ms 0 ms 0 ms 0 ms
task results
Scheduler delay 5 s 6 s 6 s 7 s 12 s
I also tried reducing the number of records by aggregating on (x,y) as the key (i.e. using aggregations instead of training on every raw record), but encountered by the following exception:
Loss was due to java.lang.NullPointerException
java.lang.NullPointerException
at org.apache.spark.api.java.JavaPairRDD$$anonfun$pairFunToScalaFun$1.apply(JavaPairRDD.scala:750)
at org.apache.spark.api.java.JavaPairRDD$$anonfun$pairFunToScalaFun$1.apply(JavaPairRDD.scala:750)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:59)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$1.apply(PairRDDFunctions.scala:96)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$1.apply(PairRDDFunctions.scala:95)
at org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:582)
at org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:582)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:158)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
at org.apache.spark.scheduler.Task.run(Task.scala:51)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment