Skip to content

Instantly share code, notes, and snippets.

@elasticdog
Created June 12, 2012 14:45
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save elasticdog/2917954 to your computer and use it in GitHub Desktop.
Save elasticdog/2917954 to your computer and use it in GitHub Desktop.
ocf::heartbeat:IPaddr2 FAILED
Two nodes, alice and bob that are set up to be redundant load balancers using
HAProxy. There are two floating IP address resources that need to be on the
machine for HAProxy to bind to (so only one machine will be running the HAProxy
resource at a time). Both machines are x86_64 CentOS 6.1 running corosync 1.4.1
and pacemaker 1.1.6.
alice.example.com = 172.16.32.10/24 (eth1)
bob.example.com = 172.16.32.11/24 (eth1)
floating-ip-01 = 192.0.2.100/24 (eth1:1)
floating-ip-02 = 192.0.2.200/24 (eth1:2)
Most of the time, migrating the resources between nodes works
flawlessly...every now and then one of the floating-ip resources will fail and
blocks the HAProxy resource from being migrated properly. You can see that the
failcount jumps up to infinity, but running a quick `crm resource cleanup
floating-ip-01` fixes things right up.
I ignore quorum due to there only being two nodes, and I also have no STONITH
just because of lack of availability. I've seen this behavior in our production
setup, and can reproduce it using VirtualBox VMs with this same configuration.
It's a very similar situation to this mailing list thread [1], but the proposed
cause of there being no primary address on the interface does not apply here.
The only error I see is:
"ERROR: RTNETLINK answers: Cannot assign requested address"
...although it's unclear if that's from starting or stopping one of the IPs
(full log below).
[1] http://oss.clusterlabs.org/pipermail/pacemaker/2012-February/013293.html
node alice.example.com \
attributes standby="on"
node bob.example.com \
attributes standby="off"
primitive floating-ip-01 ocf:heartbeat:IPaddr2 \
params ip="192.0.2.100" cidr_netmask="24" nic="eth1" iflabel="1" \
op monitor interval="4s"
primitive floating-ip-02 ocf:heartbeat:IPaddr2 \
params ip="192.0.2.200" cidr_netmask="24" nic="eth1" iflabel="2" \
op monitor interval="4s"
primitive haproxy lsb:haproxy \
op monitor interval="4s"
colocation haproxy-with-ips inf: haproxy floating-ip-01 floating-ip-02
order haproxy-after-ips inf: ( floating-ip-01 floating-ip-02 ) haproxy
property $id="cib-bootstrap-options" \
dc-version="1.1.6-3.el6-a02c0f19a00c1eb2527ad38f146ebc0834814558" \
cluster-infrastructure="openais" \
expected-quorum-votes="2" \
stonith-enabled="false" \
no-quorum-policy="ignore" \
last-lrm-refresh="1339449248"
rsc_defaults $id="rsc-options" \
resource-stickiness="100"
============
Last updated: Tue Jun 12 14:24:12 2012
Last change: Mon Jun 11 21:54:41 2012 via crm_attribute on alice.example.com
Stack: openais
Current DC: alice.example.com - partition with quorum
Version: 1.1.6-3.el6-a02c0f19a00c1eb2527ad38f146ebc0834814558
2 Nodes configured, 2 expected votes
3 Resources configured.
============
Node alice.example.com: standby
Online: [ bob.example.com ]
floating-ip-01 (ocf::heartbeat:IPaddr2): Started alice.example.com (unmanaged) FAILED
Migration summary:
* Node alice.example.com:
floating-ip-01: migration-threshold=1000000 fail-count=1000000
* Node bob.example.com:
Failed actions:
floating-ip-01_stop_0 (node=alice.example.com, call=25, rc=1, status=complete): unknown error
Jun 11 19:01:15 alice cib: [1240]: info: cib:diff: - <cib admin_epoch="0" epoch="177" num_updates="31" >
Jun 11 19:01:15 alice cib: [1240]: info: cib:diff: - <configuration >
Jun 11 19:01:15 alice cib: [1240]: info: cib:diff: - <nodes >
Jun 11 19:01:15 alice cib: [1240]: info: cib:diff: - <node id="alice.example.com" >
Jun 11 19:01:15 alice cib: [1240]: info: cib:diff: - <instance_attributes id="nodes-alice.example.com" >
Jun 11 19:01:15 alice cib: [1240]: info: cib:diff: - <nvpair value="off" id="nodes-alice.example.com-standby" />
Jun 11 19:01:15 alice cib: [1240]: info: cib:diff: - </instance_attributes>
Jun 11 19:01:15 alice cib: [1240]: info: cib:diff: - </node>
Jun 11 19:01:15 alice cib: [1240]: info: cib:diff: - </nodes>
Jun 11 19:01:15 alice cib: [1240]: info: cib:diff: - </configuration>
Jun 11 19:01:15 alice cib: [1240]: info: cib:diff: - </cib>
Jun 11 19:01:15 alice crmd: [1244]: info: abort_transition_graph: te_update_diff:126 - Triggered transition abort (complete=1, tag=diff, id=(null), magic=NA, cib=0.178.1) : Non-status change
Jun 11 19:01:15 alice crmd: [1244]: info: do_state_transition: State transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL origin=abort_transition_graph ]
Jun 11 19:01:15 alice crmd: [1244]: info: do_state_transition: All 2 cluster nodes are eligible to run resources.
Jun 11 19:01:15 alice crmd: [1244]: info: do_pe_invoke: Query 160: Requesting the current CIB: S_POLICY_ENGINE
Jun 11 19:01:15 alice cib: [1240]: info: cib:diff: + <cib epoch="178" num_updates="1" admin_epoch="0" validate-with="pacemaker-1.2" crm_feature_set="3.0.5" update-origin="alice.example.com" update-client="cibadmin" cib-last-written="Mon Jun 11 19:00:37 2012" have-quorum="1" dc-uuid="alice.example.com" >
Jun 11 19:01:15 alice cib: [1240]: info: cib:diff: + <configuration >
Jun 11 19:01:15 alice cib: [1240]: info: cib:diff: + <nodes >
Jun 11 19:01:15 alice cib: [1240]: info: cib:diff: + <node id="alice.example.com" type="normal" uname="alice.example.com" >
Jun 11 19:01:15 alice cib: [1240]: info: cib:diff: + <instance_attributes id="nodes-alice.example.com" >
Jun 11 19:01:15 alice cib: [1240]: info: cib:diff: + <nvpair id="nodes-alice.example.com-standby" name="standby" value="on" />
Jun 11 19:01:15 alice cib: [1240]: info: cib:diff: + </instance_attributes>
Jun 11 19:01:15 alice cib: [1240]: info: cib:diff: + </node>
Jun 11 19:01:15 alice cib: [1240]: info: cib:diff: + </nodes>
Jun 11 19:01:15 alice cib: [1240]: info: cib:diff: + </configuration>
Jun 11 19:01:15 alice cib: [1240]: info: cib:diff: + </cib>
Jun 11 19:01:15 alice cib: [1240]: info: cib_process_request: Operation complete: op cib_modify for section nodes (origin=local/crm_attribute/4, version=0.178.1): ok (rc=0)
Jun 11 19:01:15 alice pengine: [1243]: notice: unpack_config: On loss of CCM Quorum: Ignore
Jun 11 19:01:15 alice pengine: [1243]: notice: RecurringOp: Start recurring monitor (4s) for floating-ip-01 on bob.example.com
Jun 11 19:01:15 alice crmd: [1244]: info: do_pe_invoke_callback: Invoking the PE: query=160, ref=pe_calc-dc-1339441275-74, seq=60, quorate=1
Jun 11 19:01:15 alice pengine: [1243]: notice: RecurringOp: Start recurring monitor (4s) for floating-ip-02 on bob.example.com
Jun 11 19:01:15 alice pengine: [1243]: notice: RecurringOp: Start recurring monitor (4s) for haproxy on bob.example.com
Jun 11 19:01:15 alice pengine: [1243]: notice: LogActions: Move floating-ip-01#011(Started alice.example.com -> bob.example.com)
Jun 11 19:01:15 alice pengine: [1243]: notice: LogActions: Move floating-ip-02#011(Started alice.example.com -> bob.example.com)
Jun 11 19:01:15 alice pengine: [1243]: notice: LogActions: Move haproxy#011(Started alice.example.com -> bob.example.com)
Jun 11 19:01:15 alice crmd: [1244]: info: do_state_transition: State transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS cause=C_IPC_MESSAGE origin=handle_response ]
Jun 11 19:01:15 alice crmd: [1244]: WARN: destroy_action: Cancelling timer for action 1 (src=156)
Jun 11 19:01:15 alice crmd: [1244]: WARN: destroy_action: Cancelling timer for action 4 (src=158)
Jun 11 19:01:15 alice crmd: [1244]: WARN: destroy_action: Cancelling timer for action 3 (src=160)
Jun 11 19:01:15 alice crmd: [1244]: info: unpack_graph: Unpacked transition 11: 10 actions in 10 synapses
Jun 11 19:01:15 alice crmd: [1244]: info: do_te_invoke: Processing graph 11 (ref=pe_calc-dc-1339441275-74) derived from /var/lib/pengine/pe-input-298.bz2
Jun 11 19:01:15 alice crmd: [1244]: info: te_rsc_command: Initiating action 14: stop haproxy_stop_0 on alice.example.com (local)
Jun 11 19:01:15 alice lrmd: [1241]: info: cancel_op: operation monitor[13] on lsb::haproxy::haproxy for client 1244, its parameters: CRM_meta_name=[monitor] crm_feature_set=[3.0.5] CRM_meta_timeout=[20000] CRM_meta_interval=[4000] cancelled
Jun 11 19:01:15 alice crmd: [1244]: info: do_lrm_rsc_op: Performing key=14:11:0:e096d26e-ceee-466c-821f-52f7e8c5d425 op=haproxy_stop_0 )
Jun 11 19:01:15 alice lrmd: [1241]: info: rsc:haproxy:14: stop
Jun 11 19:01:15 alice crmd: [1244]: info: process_lrm_event: LRM operation haproxy_monitor_4000 (call=13, status=1, cib-update=0, confirmed=true) Cancelled
Jun 11 19:01:15 alice lrmd: [19369]: WARN: For LSB init script, no additional parameters are needed.
Jun 11 19:01:15 alice pengine: [1243]: notice: process_pe_message: Transition 11: PEngine Input stored in: /var/lib/pengine/pe-input-298.bz2
Jun 11 19:01:15 alice lrmd: [1241]: info: RA output: (haproxy:stop:stdout) Stopping haproxy:
Jun 11 19:01:16 alice lrmd: [1241]: info: RA output: (haproxy:stop:stdout) #033[60G[#033[0;32m OK #033[0;39m]#015
Jun 11 19:01:16 alice lrmd: [1241]: info: RA output: (haproxy:stop:stdout)
Jun 11 19:01:16 alice crmd: [1244]: info: process_lrm_event: LRM operation haproxy_stop_0 (call=14, rc=0, cib-update=161, confirmed=true) ok
Jun 11 19:01:16 alice crmd: [1244]: info: match_graph_event: Action haproxy_stop_0 (14) confirmed on alice.example.com (rc=0)
Jun 11 19:01:16 alice crmd: [1244]: info: te_rsc_command: Initiating action 8: stop floating-ip-01_stop_0 on alice.example.com (local)
Jun 11 19:01:16 alice lrmd: [1241]: info: cancel_op: operation monitor[11] on ocf::IPaddr2::floating-ip-01 for client 1244, its parameters: cidr_netmask=[24] nic=[eth1] crm_feature_set=[3.0.5] ip=[192.0.2.100] iflabel=[1] CRM_meta_name=[monitor] CRM_meta_timeout=[20000] CRM_meta_interval=[4000] cancelled
Jun 11 19:01:16 alice crmd: [1244]: info: do_lrm_rsc_op: Performing key=8:11:0:e096d26e-ceee-466c-821f-52f7e8c5d425 op=floating-ip-01_stop_0 )
Jun 11 19:01:16 alice lrmd: [1241]: info: rsc:floating-ip-01:15: stop
Jun 11 19:01:16 alice crmd: [1244]: info: te_rsc_command: Initiating action 11: stop floating-ip-02_stop_0 on alice.example.com (local)
Jun 11 19:01:16 alice lrmd: [1241]: info: cancel_op: operation monitor[12] on ocf::IPaddr2::floating-ip-02 for client 1244, its parameters: cidr_netmask=[24] nic=[eth1] crm_feature_set=[3.0.5] ip=[192.0.2.200] iflabel=[2] CRM_meta_name=[monitor] CRM_meta_timeout=[20000] CRM_meta_interval=[4000] cancelled
Jun 11 19:01:16 alice crmd: [1244]: info: do_lrm_rsc_op: Performing key=11:11:0:e096d26e-ceee-466c-821f-52f7e8c5d425 op=floating-ip-02_stop_0 )
Jun 11 19:01:16 alice lrmd: [1241]: info: rsc:floating-ip-02:16: stop
Jun 11 19:01:16 alice crmd: [1244]: info: process_lrm_event: LRM operation floating-ip-01_monitor_4000 (call=11, status=1, cib-update=0, confirmed=true) Cancelled
Jun 11 19:01:16 alice crmd: [1244]: info: process_lrm_event: LRM operation floating-ip-02_monitor_4000 (call=12, status=1, cib-update=0, confirmed=true) Cancelled
Jun 11 19:01:16 alice IPaddr2[19376]: INFO: IP status = ok, IP_CIP=
Jun 11 19:01:16 alice IPaddr2[19375]: INFO: IP status = ok, IP_CIP=
Jun 11 19:01:16 alice crmd: [1244]: info: process_lrm_event: LRM operation floating-ip-01_stop_0 (call=15, rc=0, cib-update=162, confirmed=true) ok
Jun 11 19:01:16 alice crmd: [1244]: info: match_graph_event: Action floating-ip-01_stop_0 (8) confirmed on alice.example.com (rc=0)
Jun 11 19:01:16 alice crmd: [1244]: info: te_rsc_command: Initiating action 9: start floating-ip-01_start_0 on bob.example.com
Jun 11 19:01:16 alice IPaddr2[19376]: ERROR: RTNETLINK answers: Cannot assign requested address
Jun 11 19:01:16 alice crmd: [1244]: info: process_lrm_event: LRM operation floating-ip-02_stop_0 (call=16, rc=1, cib-update=163, confirmed=true) unknown error
Jun 11 19:01:16 alice crmd: [1244]: WARN: status_from_rc: Action 11 (floating-ip-02_stop_0) on alice.example.com failed (target: 0 vs. rc: 1): Error
Jun 11 19:01:16 alice crmd: [1244]: WARN: update_failcount: Updating failcount for floating-ip-02 on alice.example.com after failed stop: rc=1 (update=INFINITY, time=1339441276)
Jun 11 19:01:16 alice attrd: [1242]: notice: attrd_trigger_update: Sending flush op to all hosts for: fail-count-floating-ip-02 (INFINITY)
Jun 11 19:01:16 alice attrd: [1242]: notice: attrd_perform_update: Sent update 33: fail-count-floating-ip-02=INFINITY
Jun 11 19:01:16 alice crmd: [1244]: info: abort_transition_graph: match_graph_event:277 - Triggered transition abort (complete=0, tag=lrm_rsc_op, id=floating-ip-02_last_failure_0, magic=0:1;11:11:0:e096d26e-ceee-466c-821f-52f7e8c5d425, cib=0.178.4) : Event failed
Jun 11 19:01:16 alice attrd: [1242]: notice: attrd_trigger_update: Sending flush op to all hosts for: last-failure-floating-ip-02 (1339441276)
Jun 11 19:01:16 alice crmd: [1244]: info: update_abort_priority: Abort priority upgraded from 0 to 1
Jun 11 19:01:16 alice crmd: [1244]: info: update_abort_priority: Abort action done superceeded by restart
Jun 11 19:01:16 alice crmd: [1244]: info: match_graph_event: Action floating-ip-02_stop_0 (11) confirmed on alice.example.com (rc=4)
Jun 11 19:01:16 alice crmd: [1244]: info: abort_transition_graph: te_update_diff:176 - Triggered transition abort (complete=0, tag=nvpair, id=status-alice.example.com-fail-count-floating-ip-02, name=fail-count-floating-ip-02, value=INFINITY, magic=NA, cib=0.178.5) : Transient attribute: update
Jun 11 19:01:16 alice crmd: [1244]: info: update_abort_priority: Abort priority upgraded from 1 to 1000000
Jun 11 19:01:16 alice crmd: [1244]: info: update_abort_priority: 'Event failed' abort superceeded
Jun 11 19:01:16 alice attrd: [1242]: notice: attrd_perform_update: Sent update 36: last-failure-floating-ip-02=1339441276
Jun 11 19:01:16 alice crmd: [1244]: info: abort_transition_graph: te_update_diff:176 - Triggered transition abort (complete=0, tag=nvpair, id=status-alice.example.com-last-failure-floating-ip-02, name=last-failure-floating-ip-02, value=1339441276, magic=NA, cib=0.178.6) : Transient attribute: update
Jun 11 19:01:16 alice crmd: [1244]: info: match_graph_event: Action floating-ip-01_start_0 (9) confirmed on bob.example.com (rc=0)
Jun 11 19:01:16 alice crmd: [1244]: info: run_graph: ====================================================
Jun 11 19:01:16 alice crmd: [1244]: notice: run_graph: Transition 11 (Complete=4, Pending=0, Fired=0, Skipped=6, Incomplete=0, Source=/var/lib/pengine/pe-input-298.bz2): Stopped
Jun 11 19:01:16 alice crmd: [1244]: info: te_graph_trigger: Transition 11 is now complete
Jun 11 19:01:16 alice crmd: [1244]: info: do_state_transition: State transition S_TRANSITION_ENGINE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL origin=notify_crmd ]
Jun 11 19:01:16 alice crmd: [1244]: info: do_state_transition: All 2 cluster nodes are eligible to run resources.
Jun 11 19:01:16 alice crmd: [1244]: info: do_pe_invoke: Query 164: Requesting the current CIB: S_POLICY_ENGINE
Jun 11 19:01:16 alice pengine: [1243]: notice: unpack_config: On loss of CCM Quorum: Ignore
Jun 11 19:01:16 alice pengine: [1243]: WARN: unpack_rsc_op: Processing failed op floating-ip-02_last_failure_0 on alice.example.com: unknown error (1)
Jun 11 19:01:16 alice pengine: [1243]: WARN: common_apply_stickiness: Forcing floating-ip-02 away from alice.example.com after 1000000 failures (max=1000000)
Jun 11 19:01:16 alice pengine: [1243]: notice: RecurringOp: Start recurring monitor (4s) for floating-ip-01 on bob.example.com
Jun 11 19:01:16 alice pengine: [1243]: notice: RecurringOp: Start recurring monitor (4s) for haproxy on bob.example.com
Jun 11 19:01:16 alice pengine: [1243]: notice: LogActions: Leave floating-ip-01#011(Started bob.example.com)
Jun 11 19:01:16 alice pengine: [1243]: notice: LogActions: Leave floating-ip-02#011(Started unmanaged)
Jun 11 19:01:16 alice pengine: [1243]: notice: LogActions: Start haproxy#011(bob.example.com)
Jun 11 19:01:16 alice crmd: [1244]: info: do_pe_invoke_callback: Invoking the PE: query=164, ref=pe_calc-dc-1339441276-79, seq=60, quorate=1
Jun 11 19:01:16 alice crmd: [1244]: info: do_state_transition: State transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS cause=C_IPC_MESSAGE origin=handle_response ]
Jun 11 19:01:16 alice crmd: [1244]: info: unpack_graph: Unpacked transition 12: 1 actions in 1 synapses
Jun 11 19:01:16 alice crmd: [1244]: info: do_te_invoke: Processing graph 12 (ref=pe_calc-dc-1339441276-79) derived from /var/lib/pengine/pe-input-299.bz2
Jun 11 19:01:16 alice crmd: [1244]: info: te_rsc_command: Initiating action 7: monitor floating-ip-01_monitor_4000 on bob.example.com
Jun 11 19:01:16 alice pengine: [1243]: notice: process_pe_message: Transition 12: PEngine Input stored in: /var/lib/pengine/pe-input-299.bz2
Jun 11 19:01:16 alice crmd: [1244]: info: match_graph_event: Action floating-ip-01_monitor_4000 (7) confirmed on bob.example.com (rc=0)
Jun 11 19:01:16 alice crmd: [1244]: info: run_graph: ====================================================
Jun 11 19:01:16 alice crmd: [1244]: notice: run_graph: Transition 12 (Complete=1, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pengine/pe-input-299.bz2): Complete
Jun 11 19:01:16 alice crmd: [1244]: info: te_graph_trigger: Transition 12 is now complete
Jun 11 19:01:16 alice crmd: [1244]: info: notify_crmd: Transition 12 status: done - <null>
Jun 11 19:01:16 alice crmd: [1244]: info: do_state_transition: State transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS cause=C_FSA_INTERNAL origin=notify_crmd ]
Jun 11 19:01:16 alice crmd: [1244]: info: do_state_transition: Starting PEngine Recheck Timer
Jun 11 19:03:34 alice cib: [1240]: info: cib_stats: Processed 158 operations (3227.00us average, 0% utilization) in the last 10min
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment