Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save sayalilunkad/17913a0bc256b81d2e670fedd9db67df to your computer and use it in GitHub Desktop.
Save sayalilunkad/17913a0bc256b81d2e670fedd9db67df to your computer and use it in GitHub Desktop.
troubleshooting-diff
diff --git a/presentations/2017-support-enablement-training/magnum-intro/handson/troubleshooting.md b/presentations/2017-support-enablement-training/magnum-intro/handson/troubleshooting.md
index 77b027e..363e4c7 100644
--- a/presentations/2017-support-enablement-training/magnum-intro/handson/troubleshooting.md
+++ b/presentations/2017-support-enablement-training/magnum-intro/handson/troubleshooting.md
@@ -1,5 +1,35 @@
# Troubleshooting
+## Preparation: Some Sabotage
+
+* Continue allowing the controller access to the Heat API:
+
+```
+iptables -A INPUT -s $(ip addr sh br-public | grep -w inet | awk '{print $2}' | sed 's#/.*##') -p tcp --dport 8004 -j ACCEPT
+```
+
+* Bar the floating IP subnet from access to the Heat API:
+
+```
+iptables -A INPUT -s $(neutron net-list | grep floating | awk '{print $7}') -p tcp --dport 8004 -j REJECT --reject-with tcp-reset
+```
+
+## Preparation: Create a Fresh Cluster
+
+* Delete existing cluster
+
+```
+magnum cluster-delete k8s_cluster
+```
+
+* Create a new cluster
+
+```
+ magnum cluster-create --name k8s_cluster \
+ --cluster-template k8s_template \
+ --node-count 1
+```
+
## When is a Cluster Complete?
* Signalling mechanism: Heat WaitCondition
@@ -31,9 +61,11 @@ to get to that point that indicates a successfully deployed instance. If all
of the stack's WaitConditions transition to state CREATE_COMPLETE that way, the
cluster itself is marked complete.
-There are a number of ways this can go wrong. First of all, the instance may be
-unable to reach the Heat API for some reason. We are later going to use the
-blunt instrument of an iptables rule to bring that situation about.
+There are a number of ways this can go wrong, and they are the main failure
+mode for cluster creation (barring problems with resource creation such as the
+ever popular "no valid host was found" from Nova). First of all, the instance
+may be unable to reach the Heat API for some reason. We are later going to use
+the blunt instrument of an iptables rule to bring that situation about.
Second, the instance may simply take to long, causing the wait condition to
timeout and thus transition to CREATE_FAILED.
@@ -44,5 +76,131 @@ to reach the point where it curls the Heat API.
-->
-iptables -A INPUT -s $(ip addr sh br-public | grep -w inet | awk '{print $2}' | sed 's#/.*##') -p tcp --dport 8004 -j ACCEPT
-iptables -A INPUT -s $(neutron net-list | grep floating | awk '{print $7}') -p tcp --dport 8004 -j REJECT --reject-with tcp-reset
+## Figuring out the failing WaitCondition
+
+* Step 1: obtain cluster's Heat stack UUID
+
+```
+stack=$(magnum cluster-show k8s_cluster | grep stack_id | awk '{print $4}')
+```
+
+* Step 2: find failing WaitCondition
+
+```
+substack=$(substack=$(openstack stack resource list -n 5 $stack | grep FAILED | grep OS::Heat::WaitCondition | awk '{print $11}')
+waitcondition=$(substack=$(openstack stack resource list -n 5 $stack | grep FAILED | grep OS::Heat::WaitCondition | awk '{print $2}')
+```
+
+* Step 3: taking a look
+
+```
+openstack stack resource show $substack $waitcondition
+```
+
+* Sample output:
+
+```
+| kube_masters | 8f51cd8f-cfd7-4d6c-91ee-11ea707366e0 | OS::Heat::ResourceGroup | CREATE_FAILED | 2017-05-12T09:25:37Z | k8s_cluster-hd7n2lgohdbm |
+| 0 | 97325e8c-ecf3-4761-b478-2e1197029b43 | file:///usr/lib/python2.7/site-packages/magnum/drivers/k8s_opensuse_v1/templates/kubemaster.yaml | CREATE_FAILED | 2017-05-12T09:27:27Z | k8s_cluster-hd7n2lgohdbm-kube_masters-cjcrhdowcrij |
+| master_wait_condition | | OS::Heat::WaitCondition | CREATE_FAILED | 2017-05-12T09:27:38Z | k8s_cluster-hd7n2lgohdbm-kube_masters-cjcrhdowcrij-0-vrm7vvcvtwyn |
+
+```
+
+## Checking Inside the Instance
+
+* First port of call: `/var/log/cloud-init-output.log`
+
+```
+#!/bin/bash -v
+curl -i -X POST -H 'X-Auth-Token: 8b4f0181636849e094e41e7a17b7a42b' -H 'Content-Type: application/json' -H 'Accept: application/json' http://192.168.232.2:8004/v1/a3a58e6bd19041aabba1d71209bfd0ac/stacks/k8s_cluster-hd7n2lgohdbm-kube_masters-cjcrhdowcrij-0-vrm7vvcvtwyn/97325e8c-ecf3-4761-b478-2e1197029b43/resources/master_wait_handle/signal --data-binary '{"status": "SUCCESS"}'
+ % Total % Received % Xferd Average Speed Time Time Time Current
+ Dload Upload Total Spent Left Speed
+ ^M 0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0curl: (7) Failed to connect to 192.168.232.2 port 8004: Connection refused
+ 2017-05-12 09:31:09,517 - util.py[WARNING]: Failed running /var/lib/cloud/instance/scripts/part-007 [7]
+ 2017-05-12 09:31:09,565 - cc_scripts_user.py[WARNING]: Failed to run module scripts-user (scripts in /var/lib/cloud/instance/scripts)
+```
+
+* Where does that script come from?
+
+ * Heat templates in `/usr/lib/python2.7/site-packages/magnum/drivers/k8s_opensuse_v1/templates/`
+
+ * Either `kubemaster.yaml` or `kubeminion.yaml`
+
+ * Look for `OS::Heat::MultipartMime` resource
+
+ * Numbering of scripts in `/var/lib/cloud/instances/*/scripts/` matches order of scripts in resource (count starts at 1)
+
+ * Backtrack through `get_resource` and `get_file` to figure out script file name ()
+
+```
+cat /var/lib/cloud/instances/a75bc53f-a63e-427a-8c33-6c2f8dd4e9c9/scripts/part-007
+#!/bin/bash -v
+curl -i -X POST -H 'X-Auth-Token: 8b4f0181636849e094e41e7a17b7a42b' -H 'Content-Type: application/json' -H 'Accept: application/json' http://192.168.232.2:8004/v1/a3a58e6bd19041aabba1d71209bfd0ac/stacks/k8s_cluster-hd7n2lgohdbm-kube_masters-cjcrhdowcrij-0-vrm7vvcvtwyn/97325e8c-ecf3-4761-b478-2e1197029b43/resources/master_wait_handle/signal --data-binary '{"status": "SUCCESS"}'
+```
+
+
+## Undo Sabotage
+
+```
+iptables -D INPUT -s $(ip addr sh br-public | grep -w inet | awk '{print $2}' | sed 's#/.*##') -p tcp --dport 8004 -j ACCEPT
+```
+
+```
+iptables -D INPUT -s $(neutron net-list | grep floating | awk '{print $7}') -p tcp --dport 8004 -j REJECT --reject-with tcp-reset
+```
+
+
+## Run curl command manually after removing sabotage
+
+```
+kube-master0:~ # bash /var/lib/cloud/instances/a75bc53f-a63e-427a-8c33-6c2f8dd4e9c9/scripts/part-007
+HTTP/1.1 200 OK
+Content-Type: application/json; charset=UTF-8
+Content-Length: 4
+X-Openstack-Request-Id: req-0d1c4043-c7e0-4ee3-a28f-7dc02a902e48
+Date: Fri, 12 May 2017 11:27:04 GMT
+Connection: close
+
+```
+
+## Recheck wait condition (not really neccessary)
+
+```
+root@d52-54-77-77-01-01:~ # openstack stack resource show $substack $waitcondition
++------------------------+------------------------------------------------------------------------------------------------------------------------------------+
+| Field | Value |
++------------------------+------------------------------------------------------------------------------------------------------------------------------------+
+| attributes | {u'data': u'{"1": null}'} |
+| creation_time | 2017-05-12T09:27:38Z |
+| description | |
+| links | [{u'href': u'http://192.168.232.2:8004/v1/9ab8c7e608f049ceb509f28827f46c2d/stacks/k8s_cluster-hd7n2lgohdbm-kube_masters- |
+| | cjcrhdowcrij-0-vrm7vvcvtwyn/97325e8c-ecf3-4761-b478-2e1197029b43/resources/master_wait_condition', u'rel': u'self'}, {u'href': |
+| | u'http://192.168.232.2:8004/v1/9ab8c7e608f049ceb509f28827f46c2d/stacks/k8s_cluster-hd7n2lgohdbm-kube_masters- |
+| | cjcrhdowcrij-0-vrm7vvcvtwyn/97325e8c-ecf3-4761-b478-2e1197029b43', u'rel': u'stack'}] |
+| logical_resource_id | master_wait_condition |
+| parent_resource | 0 |
+| physical_resource_id | |
+| required_by | [] |
+| resource_name | master_wait_condition |
+| resource_status | CREATE_FAILED |
+| resource_status_reason | WaitConditionTimeout: resources.master_wait_condition: 0 of 1 received |
+| resource_type | OS::Heat::WaitCondition |
+| updated_time | 2017-05-12T09:27:38Z |
++------------------------+------------------------------------------------------------------------------------------------------------------------------------+
+```
+
+## Is it a Timeout Issue?
+
+* Step 1: look at links attribute of WaitCondition resource
+
+* Step 2:
+
+```
+substack_uuid=$(heat stack-list | grep $substack | awk '{print $2}')
+grep ${substack_uuid}/resources/$(echo ${waitcondition} | sed s/handle/condition/) /var/log/heat/heat-api.log
+
+```
+2017-05-12 11:44:05.441 7129 INFO eventlet.wsgi.server [req-3b131f3c-3e13-48b2-8463-d969fdd28c51 eab822445eea4247adf61dbab2b63f2d a3a58e6bd19041aabba1d71209bfd0ac - 1e9dd1b5178e44c68b8e1907e117ed9a 1e9dd1b5178e44c68b8e1907e117ed9a] 192.168.232.2 - - [12/May/2017 11:44:05] "POST /v1/a3a58e6bd19041aabba1d71209bfd0ac/stacks/k8s_cluster-hd7n2lgohdbm-kube_masters-cjcrhdowcrij-0-vrm7vvcvtwyn/97325e8c-ecf3-4761-b478-2e1197029b43/resources/master_wait_handle/signal HTTP/1.1" 200 211 2.03677
+```
+
+## More Sabotage (Proceed At Your Own Peril)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment