reachbach/description.txt

## description.txt
Metric						Min		25th 	Median	75th 	Max
Result serialization time	12 ms	13 ms	14 ms	16 ms	18 ms
Duration					4 s		4 s		5 s		5 s		5 s
Time spent fetching task 	0 ms	0 ms	0 ms	0 ms	0 ms
results
Scheduler delay				6 s		6 s		6 s		6 s		12 s


Stage Id
14 aggregate at GradientDescent.scala:178

Task Index	Task ID	Status	Locality Level	 Executor	            Launch Time	            Duration	GC 	Result Ser Time	Errors
																										Time

0 	600 	RUNNING 	PROCESS_LOCAL 	serious.dataone.foo.bar.com 	2014/06/17 10:32:27 	1.1 h
1 	601 	RUNNING 	PROCESS_LOCAL 	casual.dataone.foo.bar.com 	2014/06/17 10:32:27 		1.1 h
2 	602 	RUNNING 	PROCESS_LOCAL 	serious.dataone.foo.bar.com 	2014/06/17 10:32:27 	1.1 h
3 	603 	RUNNING 	PROCESS_LOCAL 	casual.dataone.foo.bar.com 	2014/06/17 10:32:27 		1.1 h
4 	604 	RUNNING 	PROCESS_LOCAL 	serious.dataone.foo.bar.com 	2014/06/17 10:32:27 	1.1 h
5 	605 	SUCCESS 	PROCESS_LOCAL 	casual.dataone.foo.bar.com 	2014/06/17 10:32:27 4 s 	2 s 	12 ms
6 	606 	SUCCESS 	PROCESS_LOCAL 	serious.dataone.foo.bar.com 	2014/06/17 10:32:27 	4 s 	1 s 	14 ms
7 	607 	SUCCESS 	PROCESS_LOCAL 	casual.dataone.foo.bar.com 	2014/06/17 10:32:27 4 s 	2 s 	12 ms
8 	608 	SUCCESS 	PROCESS_LOCAL 	serious.dataone.foo.bar.com 	2014/06/17 10:32:27 	5 s 	1 s 	15 ms
9 	609 	SUCCESS 	PROCESS_LOCAL 	casual.dataone.foo.bar.com 	2014/06/17 10:32:27 5 s 	1 s 	14 ms
10 	610 	SUCCESS 	PROCESS_LOCAL 	serious.dataone.foo.bar.com 	2014/06/17 10:32:27 	5 s 	1 s 	15 ms
11 	611 	SUCCESS 	PROCESS_LOCAL 	casual.dataone.foo.bar.com 	2014/06/17 10:32:27 4 s 	1 s 	13 ms
12 	612 	SUCCESS 	PROCESS_LOCAL 	serious.dataone.foo.bar.com 	2014/06/17 10:32:27 	5 s 	1 s 	18 ms
13 	613 	SUCCESS 	PROCESS_LOCAL 	casual.dataone.foo.bar.com 	2014/06/17 10:32:27 5 s 	1 s 	13 ms
14 	614 	SUCCESS 	PROCESS_LOCAL 	serious.dataone.foo.bar.com 	2014/06/17 10:32:27 	4 s 	1 s 	14 ms
15 	615 	SUCCESS 	PROCESS_LOCAL 	casual.dataone.foo.bar.com 	2014/06/17 10:32:27 4 s 	1 s 	12 ms
16 	616 	SUCCESS 	PROCESS_LOCAL 	serious.dataone.foo.bar.com 	2014/06/17 10:32:27 	5 s 	1 s 	15 ms
17 	617 	SUCCESS 	PROCESS_LOCAL 	casual.dataone.foo.bar.com 	2014/06/17 10:32:27 5 s 	1 s 	18 ms
18 	618 	SUCCESS 	PROCESS_LOCAL 	serious.dataone.foo.bar.com 	2014/06/17 10:32:27 	5 s 	1 s 	16 ms
19 	619 	SUCCESS 	PROCESS_LOCAL 	casual.dataone.foo.bar.com 	2014/06/17 10:32:27 4 s 	1 s 	18 ms

Executor stats:

RDD Blocks	Memory Used	Disk Used	Active Tasks	Failed Tasks	Complete Tasks	Total Tasks	Task Time	Shuffle Read	Shuffle Write
0 	0.0 B / 6.7 GB 		0.0 B 		2			 	0 					307 		309 		23.2 m 		0.0 B 			0.0 B
0 	0.0 B / 6.7 GB 		0.0 B 		3			 	0 					308 		311 		22.4 m 		0.0 B 			0.0 B


Executor jmap output:

Server compiler detected.
JVM version is 24.55-b03

using thread-local object allocation.
Parallel GC with 18 thread(s)

Heap Configuration:
   MinHeapFreeRatio = 40
   MaxHeapFreeRatio = 70
   MaxHeapSize      = 10737418240 (10240.0MB)
   NewSize          = 1310720 (1.25MB)
   MaxNewSize       = 17592186044415 MB
   OldSize          = 5439488 (5.1875MB)
   NewRatio         = 2
   SurvivorRatio    = 8
   PermSize         = 21757952 (20.75MB)
   MaxPermSize      = 134217728 (128.0MB)
   G1HeapRegionSize = 0 (0.0MB)

Heap Usage:
PS Young Generation
Eden Space:
   capacity = 2783969280 (2655.0MB)
   used     = 192583816 (183.66223907470703MB)
   free     = 2591385464 (2471.337760925293MB)
   6.917598458557704% used
From Space:
   capacity = 409993216 (391.0MB)
   used     = 1179808 (1.125152587890625MB)
   free     = 408813408 (389.8748474121094MB)
   0.2877628102022059% used
To Space:
   capacity = 385351680 (367.5MB)
   used     = 0 (0.0MB)
   free     = 385351680 (367.5MB)
   0.0% used
PS Old Generation
   capacity = 7158628352 (6827.0MB)
   used     = 4455093024 (4248.707794189453MB)
   free     = 2703535328 (2578.292205810547MB)
   62.2338918146983% used
PS Perm Generation
   capacity = 90701824 (86.5MB)
   used     = 45348832 (43.248016357421875MB)
   free     = 45352992 (43.251983642578125MB)
   49.99770677158598% used

8432 interned Strings occupying 714672 bytes.


Executor GC log snippet:

168.778: [GC [PSYoungGen: 2702831K->578545K(2916864K)] 9302453K->7460857K(9907712K), 0.3193550 secs] [Times: user=5.13 sys=0.39, real=0.32 secs]
169.097: [Full GC [PSYoungGen: 578545K->0K(2916864K)] [ParOldGen: 6882312K->1073297K(6990848K)] 7460857K->1073297K(9907712K) [PSPermGen: 44248K->44201K(88576K)], 4.5521090 secs] [Times: user=24.22 sys=0.18, real=4.55 secs]
174.207: [GC [PSYoungGen: 2338304K->81315K(2544128K)] 3411653K->1154665K(9534976K), 0.0966280 secs] [Times: user=1.66 sys=0.00, real=0.09 secs]

I tried to map partitions to cores on the nodes. Increasing the number of partitions (say to 80 or 100) would result in progress till the 6th iteration or so, but the next stage would stall as before with apparent root cause / logs. With increased partitions, the last stage that completed had the following task times:

Metric						Min		25th 	Median	75th	Max
Result serialization time	11 ms	12 ms	13 ms	15 ms	0.4 s
Duration					0.5 s	0.9 s	1 s		3 s		7 s
Time spent fetching			0 ms	0 ms	0 ms	0 ms	0 ms
task results
Scheduler delay				5 s		6 s		6 s		7 s		12 s

I also tried reducing the number of records by aggregating on (x,y) as the key (i.e. using aggregations instead of training on every raw record), but encountered by the following exception:

Loss was due to java.lang.NullPointerException
java.lang.NullPointerException
        at org.apache.spark.api.java.JavaPairRDD$$anonfun$pairFunToScalaFun$1.apply(JavaPairRDD.scala:750)
        at org.apache.spark.api.java.JavaPairRDD$$anonfun$pairFunToScalaFun$1.apply(JavaPairRDD.scala:750)
        at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
        at org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:59)
        at org.apache.spark.rdd.PairRDDFunctions$$anonfun$1.apply(PairRDDFunctions.scala:96)
        at org.apache.spark.rdd.PairRDDFunctions$$anonfun$1.apply(PairRDDFunctions.scala:95)
        at org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:582)
        at org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:582)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
        at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:158)
        at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
        at org.apache.spark.scheduler.Task.run(Task.scala:51)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)
	Metric Min 25th Median 75th Max
	Result serialization time 12 ms 13 ms 14 ms 16 ms 18 ms
	Duration 4 s 4 s 5 s 5 s 5 s
	Time spent fetching task 0 ms 0 ms 0 ms 0 ms 0 ms
	results
	Scheduler delay 6 s 6 s 6 s 6 s 12 s


	Stage Id
	14 aggregate at GradientDescent.scala:178

	Task Index Task ID Status Locality Level Executor Launch Time Duration GC Result Ser Time Errors
	Time

	0 600 RUNNING PROCESS_LOCAL serious.dataone.foo.bar.com 2014/06/17 10:32:27 1.1 h
	1 601 RUNNING PROCESS_LOCAL casual.dataone.foo.bar.com 2014/06/17 10:32:27 1.1 h
	2 602 RUNNING PROCESS_LOCAL serious.dataone.foo.bar.com 2014/06/17 10:32:27 1.1 h
	3 603 RUNNING PROCESS_LOCAL casual.dataone.foo.bar.com 2014/06/17 10:32:27 1.1 h
	4 604 RUNNING PROCESS_LOCAL serious.dataone.foo.bar.com 2014/06/17 10:32:27 1.1 h
	5 605 SUCCESS PROCESS_LOCAL casual.dataone.foo.bar.com 2014/06/17 10:32:27 4 s 2 s 12 ms
	6 606 SUCCESS PROCESS_LOCAL serious.dataone.foo.bar.com 2014/06/17 10:32:27 4 s 1 s 14 ms
	7 607 SUCCESS PROCESS_LOCAL casual.dataone.foo.bar.com 2014/06/17 10:32:27 4 s 2 s 12 ms
	8 608 SUCCESS PROCESS_LOCAL serious.dataone.foo.bar.com 2014/06/17 10:32:27 5 s 1 s 15 ms
	9 609 SUCCESS PROCESS_LOCAL casual.dataone.foo.bar.com 2014/06/17 10:32:27 5 s 1 s 14 ms
	10 610 SUCCESS PROCESS_LOCAL serious.dataone.foo.bar.com 2014/06/17 10:32:27 5 s 1 s 15 ms
	11 611 SUCCESS PROCESS_LOCAL casual.dataone.foo.bar.com 2014/06/17 10:32:27 4 s 1 s 13 ms
	12 612 SUCCESS PROCESS_LOCAL serious.dataone.foo.bar.com 2014/06/17 10:32:27 5 s 1 s 18 ms
	13 613 SUCCESS PROCESS_LOCAL casual.dataone.foo.bar.com 2014/06/17 10:32:27 5 s 1 s 13 ms
	14 614 SUCCESS PROCESS_LOCAL serious.dataone.foo.bar.com 2014/06/17 10:32:27 4 s 1 s 14 ms
	15 615 SUCCESS PROCESS_LOCAL casual.dataone.foo.bar.com 2014/06/17 10:32:27 4 s 1 s 12 ms
	16 616 SUCCESS PROCESS_LOCAL serious.dataone.foo.bar.com 2014/06/17 10:32:27 5 s 1 s 15 ms
	17 617 SUCCESS PROCESS_LOCAL casual.dataone.foo.bar.com 2014/06/17 10:32:27 5 s 1 s 18 ms
	18 618 SUCCESS PROCESS_LOCAL serious.dataone.foo.bar.com 2014/06/17 10:32:27 5 s 1 s 16 ms
	19 619 SUCCESS PROCESS_LOCAL casual.dataone.foo.bar.com 2014/06/17 10:32:27 4 s 1 s 18 ms

	Executor stats:

	RDD Blocks Memory Used Disk Used Active Tasks Failed Tasks Complete Tasks Total Tasks Task Time Shuffle Read Shuffle Write
	0 0.0 B / 6.7 GB 0.0 B 2 0 307 309 23.2 m 0.0 B 0.0 B
	0 0.0 B / 6.7 GB 0.0 B 3 0 308 311 22.4 m 0.0 B 0.0 B


	Executor jmap output:

	Server compiler detected.
	JVM version is 24.55-b03

	using thread-local object allocation.
	Parallel GC with 18 thread(s)

	Heap Configuration:
	MinHeapFreeRatio = 40
	MaxHeapFreeRatio = 70
	MaxHeapSize = 10737418240 (10240.0MB)
	NewSize = 1310720 (1.25MB)
	MaxNewSize = 17592186044415 MB
	OldSize = 5439488 (5.1875MB)
	NewRatio = 2
	SurvivorRatio = 8
	PermSize = 21757952 (20.75MB)
	MaxPermSize = 134217728 (128.0MB)
	G1HeapRegionSize = 0 (0.0MB)

	Heap Usage:
	PS Young Generation
	Eden Space:
	capacity = 2783969280 (2655.0MB)
	used = 192583816 (183.66223907470703MB)
	free = 2591385464 (2471.337760925293MB)
	6.917598458557704% used
	From Space:
	capacity = 409993216 (391.0MB)
	used = 1179808 (1.125152587890625MB)
	free = 408813408 (389.8748474121094MB)
	0.2877628102022059% used
	To Space:
	capacity = 385351680 (367.5MB)
	used = 0 (0.0MB)
	free = 385351680 (367.5MB)
	0.0% used
	PS Old Generation
	capacity = 7158628352 (6827.0MB)
	used = 4455093024 (4248.707794189453MB)
	free = 2703535328 (2578.292205810547MB)
	62.2338918146983% used
	PS Perm Generation
	capacity = 90701824 (86.5MB)
	used = 45348832 (43.248016357421875MB)
	free = 45352992 (43.251983642578125MB)
	49.99770677158598% used

	8432 interned Strings occupying 714672 bytes.


	Executor GC log snippet:

	168.778: [GC [PSYoungGen: 2702831K->578545K(2916864K)] 9302453K->7460857K(9907712K), 0.3193550 secs] [Times: user=5.13 sys=0.39, real=0.32 secs]
	169.097: [Full GC [PSYoungGen: 578545K->0K(2916864K)] [ParOldGen: 6882312K->1073297K(6990848K)] 7460857K->1073297K(9907712K) [PSPermGen: 44248K->44201K(88576K)], 4.5521090 secs] [Times: user=24.22 sys=0.18, real=4.55 secs]
	174.207: [GC [PSYoungGen: 2338304K->81315K(2544128K)] 3411653K->1154665K(9534976K), 0.0966280 secs] [Times: user=1.66 sys=0.00, real=0.09 secs]

	I tried to map partitions to cores on the nodes. Increasing the number of partitions (say to 80 or 100) would result in progress till the 6th iteration or so, but the next stage would stall as before with apparent root cause / logs. With increased partitions, the last stage that completed had the following task times:

	Metric Min 25th Median 75th Max
	Result serialization time 11 ms 12 ms 13 ms 15 ms 0.4 s
	Duration 0.5 s 0.9 s 1 s 3 s 7 s
	Time spent fetching 0 ms 0 ms 0 ms 0 ms 0 ms
	task results
	Scheduler delay 5 s 6 s 6 s 7 s 12 s

	I also tried reducing the number of records by aggregating on (x,y) as the key (i.e. using aggregations instead of training on every raw record), but encountered by the following exception:

	Loss was due to java.lang.NullPointerException
	java.lang.NullPointerException
	at org.apache.spark.api.java.JavaPairRDD$$anonfun$pairFunToScalaFun$1.apply(JavaPairRDD.scala:750)
	at org.apache.spark.api.java.JavaPairRDD$$anonfun$pairFunToScalaFun$1.apply(JavaPairRDD.scala:750)
	at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
	at org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:59)
	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$1.apply(PairRDDFunctions.scala:96)
	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$1.apply(PairRDDFunctions.scala:95)
	at org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:582)
	at org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:582)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:158)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
	at org.apache.spark.scheduler.Task.run(Task.scala:51)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
	at java.lang.Thread.run(Thread.java:745)