Skip to content

Instantly share code, notes, and snippets.

Embed
What would you like to do?
Mesos notes regarding: Recovery failed: Failed to recover registrar: Failed to perform fetch within 1mins

Mesos versions 0.20.1 and 0.21.0

Master/Primary log:

==> /var/log/mesos/mesos-master.node1.invalid-user.log.WARNING.20150120-204548.12107 <==
Log file created at: 2015/01/20 20:45:48
Running on machine: node1
Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line] msg
F0120 20:45:48.025610 12116 master.cpp:1083] Recovery failed: Failed to recover registrar: Failed to perform fetch within 1mins

Slave log:

W0120 20:55:22.160058  9624 slave.cpp:2626] Master disconnected! Waiting for a new master to be elected
I0120 20:55:32.017480  9623 detector.cpp:138] Detected a new leader: (id='33')

Solution:

Fix MESOS_QUORUM size in /etc/default/mesos-master on the master node(s). In this case there was only 1 master so the value needed to be "1".

Then restart the service:

sudo service mesos-master restart

And all will be happy:

==> /var/log/mesos/mesos-master.INFO <==
I0120 20:55:32.012778 12683 network.hpp:424] ZooKeeper group memberships changed
I0120 20:55:32.013967 12683 group.cpp:659] Trying to get '/mesos/log_replicas/0000000032' in ZooKeeper
I0120 20:55:32.014857 12687 detector.cpp:138] Detected a new leader: (id='33')
I0120 20:55:32.015980 12687 group.cpp:659] Trying to get '/mesos/info_0000000033' in ZooKeeper
I0120 20:55:32.017633 12683 network.hpp:466] ZooKeeper group PIDs: { log-replica(1)@172.16.1.11:5050 }
I0120 20:55:32.018321 12687 detector.cpp:433] A new leading master (UPID=master@172.16.1.11:5050) is detected
I0120 20:55:32.018723 12687 master.cpp:1263] The newly elected leader is master@172.16.1.11:5050 with id 20150120-205522-184619180-5050-12676
I0120 20:55:32.019014 12687 master.cpp:1276] Elected as the leading master!
I0120 20:55:32.019261 12687 master.cpp:1094] Recovering from registrar
I0120 20:55:32.020227 12687 registrar.cpp:313] Recovering registrar
I0120 20:55:32.025616 12684 log.cpp:656] Attempting to start the writer
I0120 20:55:32.034173 12690 replica.cpp:474] Replica received implicit promise request with proposal 1
I0120 20:55:32.035416 12690 leveldb.cpp:306] Persisting metadata (8 bytes) to leveldb took 807088ns
I0120 20:55:32.035841 12690 replica.cpp:342] Persisted promised to 1
I0120 20:55:32.037963 12690 coordinator.cpp:230] Coordinator attemping to fill missing position
I0120 20:55:32.044366 12688 replica.cpp:375] Replica received explicit promise request for position 0 with proposal 2
I0120 20:55:32.045568 12688 leveldb.cpp:343] Persisting action (8 bytes) to leveldb took 684451ns
I0120 20:55:32.045905 12688 replica.cpp:676] Persisted action at 0
I0120 20:55:32.054538 12685 replica.cpp:508] Replica received write request for position 0
I0120 20:55:32.055205 12685 leveldb.cpp:438] Reading position from leveldb took 55275ns
I0120 20:55:32.056123 12685 leveldb.cpp:343] Persisting action (14 bytes) to leveldb took 594093ns
I0120 20:55:32.056448 12685 replica.cpp:676] Persisted action at 0
I0120 20:55:32.061266 12688 replica.cpp:655] Replica received learned notice for position 0
I0120 20:55:32.062516 12688 leveldb.cpp:343] Persisting action (16 bytes) to leveldb took 590888ns
I0120 20:55:32.062744 12688 replica.cpp:676] Persisted action at 0
I0120 20:55:32.063030 12688 replica.cpp:661] Replica learned NOP action at position 0
I0120 20:55:32.064226 12688 log.cpp:672] Writer started with ending position 0
I0120 20:55:32.072695 12689 leveldb.cpp:438] Reading position from leveldb took 59994ns
I0120 20:55:32.083608 12688 registrar.cpp:346] Successfully fetched the registry (0B) in 61.625088ms
I0120 20:55:32.084761 12688 registrar.cpp:445] Applied 1 operations in 84017ns; attempting to update the 'registry'
I0120 20:55:32.090651 12686 log.cpp:680] Attempting to append 118 bytes to the log
I0120 20:55:32.091244 12686 coordinator.cpp:340] Coordinator attempting to write APPEND action at position 1
I0120 20:55:32.092430 12683 replica.cpp:508] Replica received write request for position 1
I0120 20:55:32.093332 12683 leveldb.cpp:343] Persisting action (135 bytes) to leveldb took 578049ns
I0120 20:55:32.093626 12683 replica.cpp:676] Persisted action at 1
I0120 20:55:32.099520 12685 replica.cpp:655] Replica received learned notice for position 1
I0120 20:55:32.100365 12685 leveldb.cpp:343] Persisting action (137 bytes) to leveldb took 659046ns
I0120 20:55:32.100677 12685 replica.cpp:676] Persisted action at 1
I0120 20:55:32.100965 12685 replica.cpp:661] Replica learned APPEND action at position 1
I0120 20:55:32.109513 12690 log.cpp:699] Attempting to truncate the log to 1
I0120 20:55:32.110175 12690 coordinator.cpp:340] Coordinator attempting to write TRUNCATE action at position 2
I0120 20:55:32.111495 12690 replica.cpp:508] Replica received write request for position 2
I0120 20:55:32.111800 12689 registrar.cpp:490] Successfully updated the 'registry' in 24.368128ms
I0120 20:55:32.112483 12689 registrar.cpp:376] Successfully recovered registrar
I0120 20:55:32.113041 12690 leveldb.cpp:343] Persisting action (16 bytes) to leveldb took 1.141049ms
I0120 20:55:32.113315 12690 replica.cpp:676] Persisted action at 2
I0120 20:55:32.117593 12685 replica.cpp:655] Replica received learned notice for position 2
I0120 20:55:32.118618 12686 master.cpp:1121] Recovered 0 slaves from the Registry (82B) ; allowing 10mins for slaves to re-register
I0120 20:55:32.118962 12685 leveldb.cpp:343] Persisting action (18 bytes) to leveldb took 1.182186ms
I0120 20:55:32.119215 12685 leveldb.cpp:401] Deleting ~1 keys from leveldb took 41224ns
I0120 20:55:32.119474 12685 replica.cpp:676] Persisted action at 2
I0120 20:55:32.119738 12685 replica.cpp:661] Replica learned TRUNCATE action at position 2
I0120 20:55:32.144054 12688 master.cpp:3068] Registering slave at slave(1)@172.16.1.15:5050 (node5) with id 20150120-205522-184619180-5050-12676-S0
I0120 20:55:32.147260 12686 registrar.cpp:445] Applied 1 operations in 698375ns; attempting to update the 'registry'
I0120 20:55:32.151125 12690 log.cpp:680] Attempting to append 282 bytes to the log
I0120 20:55:32.151716 12683 coordinator.cpp:340] Coordinator attempting to write APPEND action at position 3
I0120 20:55:32.153053 12683 replica.cpp:508] Replica received write request for position 3
I0120 20:55:32.154202 12683 leveldb.cpp:343] Persisting action (301 bytes) to leveldb took 713538ns
I0120 20:55:32.154470 12683 replica.cpp:676] Persisted action at 3
I0120 20:55:32.157678 12689 replica.cpp:655] Replica received learned notice for position 3
I0120 20:55:32.158469 12689 leveldb.cpp:343] Persisting action (303 bytes) to leveldb took 680963ns
I0120 20:55:32.158684 12689 replica.cpp:676] Persisted action at 3
I0120 20:55:32.158946 12689 replica.cpp:661] Replica learned APPEND action at position 3
I0120 20:55:32.160116 12689 log.cpp:699] Attempting to truncate the log to 3
I0120 20:55:32.160517 12689 coordinator.cpp:340] Coordinator attempting to write TRUNCATE action at position 4
I0120 20:55:32.161438 12689 replica.cpp:508] Replica received write request for position 4
I0120 20:55:32.162703 12689 leveldb.cpp:343] Persisting action (16 bytes) to leveldb took 928859ns
I0120 20:55:32.162976 12689 replica.cpp:676] Persisted action at 4
I0120 20:55:32.170222 12684 replica.cpp:655] Replica received learned notice for position 4
I0120 20:55:32.170754 12688 master.cpp:1383] Received registration request for framework 'marathon-0.7.6' at scheduler-4bb6f980-5272-46f6-84f8-50cdbb01916d@172.16.1.11:39490
I0120 20:55:32.171977 12684 leveldb.cpp:343] Persisting action (18 bytes) to leveldb took 879271ns
I0120 20:55:32.172243 12684 leveldb.cpp:401] Deleting ~2 keys from leveldb took 44395ns
I0120 20:55:32.173485 12684 replica.cpp:676] Persisted action at 4
I0120 20:55:32.173954 12684 replica.cpp:661] Replica learned TRUNCATE action at position 4
I0120 20:55:32.171046 12686 registrar.cpp:490] Successfully updated the 'registry' in 23.20896ms
I0120 20:55:32.180392 12689 hierarchical_allocator_process.hpp:442] Added slave 20150120-205522-184619180-5050-12676-S0 (node5) with cpus(*):1; mem(*):496; disk(*):35164; ports(*):[31000-32000] (and cpus(*):1; mem(*):496; disk(*):35164; ports(*):[31000-32000] available)
I0120 20:55:32.181598 12688 master.cpp:3122] Registered slave 20150120-205522-184619180-5050-12676-S0 at slave(1)@172.16.1.15:5050 (node5) with cpus(*):1; mem(*):496; disk(*):35164; ports(*):[31000-32000]
I0120 20:55:32.182005 12688 master.cpp:1383] Received registration request for framework 'marathon-0.7.6' at scheduler-4bb6f980-5272-46f6-84f8-50cdbb01916d@172.16.1.11:39490
I0120 20:55:32.183660 12688 master.cpp:1447] Registering framework 20150120-205522-184619180-5050-12676-0000 (marathon-0.7.6) at scheduler-4bb6f980-5272-46f6-84f8-50cdbb01916d@172.16.1.11:39490
I0120 20:55:32.185972 12690 hierarchical_allocator_process.hpp:329] Added framework 20150120-205522-184619180-5050-12676-0000
I0120 20:55:32.189689 12688 master.cpp:1434] Framework 20150120-205522-184619180-5050-12676-0000 (marathon-0.7.6) at scheduler-4bb6f980-5272-46f6-84f8-50cdbb01916d@172.16.1.11:39490 already registered, resending acknowledgement

Related resources:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.
You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session.