jiayuzhou/SparkProfileEMR.md

## SparkProfileEMR.md

      
    Raw
  

              SparkProfileEMR.md
            
          
How to log into slaves


One can log into the slave nodes using Agent Forwarding by SSH. Suppose that I find in the YARN log
java.nio.channels.CancelledKeyException
14/09/25 23:12:58 ERROR SendingConnection: Exception while reading SendingConnection to ConnectionManagerId(ip-172-31-xx-xx.ec2.internal,49972)
java.io.IOException: Connection reset by peer
...

I would like to SSH to the slave ip-172-31-xx-xx.ec2.internal to check the logs. To do so, firstly we have to add EMR identity to the agent
you@localhost: ssh-add ~/EC2_KEY.pem 

and then log SSH to master using SSH agent forwarding
you@localhost: ssh -A hadoop@ec2-xx-xx-xx-xx.compute-1.amazonaws.com -i ~/EC2_KEY.pem

and then we are able to log into the slave node:
hadoop@master: ssh hadoop@ip-172-31-xx-xx.ec2.internal


Find the lost block manager/container


Sometimes we find that something is wrong and we end up with a FetchFailed:
FetchFailed(BlockManagerId(81, ip-172-31-xx-xx.ec2.internal, 52487, 0), shuffleId=83, mapId=44, reduceId=49)

First we check the application master in resource manager
http://ec2-xx-xx-xx-xx.compute-1.amazonaws.com:9026/cluster/app/application_1413826885383_0001

And click on the nodes
http://ec2-xx-xx-xx-xx.compute-1.amazonaws.com:9026/cluster/nodes

Find the ip-172-31-xx-xx.ec2.internal and click inside
http://ip-172-31-xx-xx.ec2.internal:9035/

and qw can find in List of Containers and see no container is using BlockManagerId(81, ip-172-31-xx-xx.ec2.internal, 52487, 0).
This tells us that our block manager is gone. Now we want to find the container
using that killed block manager.
Go to Tools>Local logs on the left:
http://ip-172-31-xx-xx.ec2.internal:9035/logs/userlogs/application_1413826885383_0001/

We check each of the container folders and find the one that used the block manager,
say, container_1413826885383_0001_01_000086. And we are able to find the block manager
in the following log.
http://ip-172-31-xx-xx.ec2.internal:9035/logs/userlogs/application_1413826885383_0001/container_1413826885383_0001_01_000086/stderr

Now we would like to know that why the container container_1413826885383_0001_01_000086 is killed.
To do so we investiage the log file at
http://ip-172-31-xx-xx.ec2.internal:9035/logs/yarn-hadoop-nodemanager-ip-172-31-xx-xx.ec2.internal.log.2014-10-20-18

And we find out that the node is killed because some problems, e.g., memory limit reached:
Limit=11274289152, current usage = 11443380224
2014-10-20 18:00:20,624 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl (Container Monitor): Container [pid=3377,containerID=container_1413826885383_0001_01_000086] is running beyond physical memory limits. Current usage: 10.7 GB of 10.5 GB physical memory used; 11.3 GB of 52.5 GB virtual memory used. Killing container.


Profile a live container


Once we have identified one container that causes the problem, we can dump the memory
associated with the process that is running the container. After we log into the
internal node, we can check the logs from node manager,e.g.,
hadoop@internal:tail /mnt/var/log/hadoop/yarn-hadoop-nodemanager-ip-172-31-xx-xx.ec2.internal.log

we are able to find the status of each container, memory usage. For example for the container container_1414015957471_0001_01_000015
2014-10-22 22:22:26,574 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl (Container Monitor): Memory usage of ProcessTree 3929 for container-id container_1414015957471_0001_01_000015: 7.3 GB of 20.5 GB physical memory used; 21.5 GB of 102.5 GB virtual memory used

Suppose we try to analyze the memory usage of this container. The first thing we need to
know is how to find the system process id running this container. To do so, we can search
in system process list:
hadoop@internal:ps aux | grep container_1414015957471_0001_01_000015

And we are able to find
hadoop    3929  0.0  0.0 112916  1216 ?        Ss   22:14   0:00 /bin/bash -c /usr/java/latest/bin/java -server -XX:OnOutOfMemoryError='kill %p' -Xms20480m -Xmx20480m  -Djava.io.tmpdir=/mnt/var/lib/hadoop/tmp/nm-local-dir/usercache/hadoop/appcache/application_1414015957471_0001/container_1414015957471_0001_01_000015/tmp '-Dspark.akka.timeout=200' '-Dspark.akka.frameSize=100' org.apache.spark.executor.CoarseGrainedExecutorBackend akka.tcp://sparkDriver@ip-172-31-32-201.ec2.internal:37254/user/CoarseGrainedScheduler 25 ip-172-31-32-193.ec2.internal 6 1> /mnt/var/log/hadoop/userlogs/application_1414015957471_0001/container_1414015957471_0001_01_000015/stdout 2> /mnt/var/log/hadoop/userlogs/application_1414015957471_0001/container_1414015957471_0001_01_000015/stderr

hadoop    3940 58.6 14.4 22484956 8887668 ?    Sl   22:14   5:59 /usr/java/latest/bin/java -server -XX:OnOutOfMemoryError=kill %p -Xms20480m -Xmx20480m -Djava.io.tmpdir=/mnt/var/lib/hadoop/tmp/nm-local-dir/usercache/hadoop/appcache/application_1414015957471_0001/container_1414015957471_0001_01_000015/tmp -Dspark.akka.timeout=200 -Dspark.akka.frameSize=100 org.apache.spark.executor.CoarseGrainedExecutorBackend akka.tcp://sparkDriver@ip-172-31-32-201.ec2.internal:37254/user/CoarseGrainedScheduler 25 ip-172-31-32-193.ec2.internal 6

hadoop    8006  0.0  0.0 110280   860 pts/0    S+   22:24   0:00 grep container_1414015957471_0001_01_000015

where the first one is the bash script and the last one is the grep we were doing. Thus the pid of
the process running the container is 3940. We now can use java utilities, e.g.,
use jstack to observe the current stack:
hadoop@internal:jstack 3940 | vim -

or we can dump the current memory content to a file
hadoop@internal:jmap -dump:format=b,file=/mnt/3940.dump 3940

and use jhat or VisualVM to analyze the memory status. For example,
run the following command
hadoop@internal:jhat /mnt/3940.dump

the computer will analyze the dump file and setup a web server at port 7000.
And by using sock connection, we can access the webpage at
http://ip-172-31-xx-xx.ec2.internal:7000/

and show instance counts:
http://ip-172-31-xx-xx.ec2.internal:7000/showInstanceCounts/