Skip to content

Instantly share code, notes, and snippets.

@jaytaylor
Last active August 29, 2015 14:13
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save jaytaylor/3fced64bcf1fecc61d18 to your computer and use it in GitHub Desktop.
Save jaytaylor/3fced64bcf1fecc61d18 to your computer and use it in GitHub Desktop.
Mesos notes regarding: Recovery failed: Failed to recover registrar: Failed to perform fetch within 1mins

Mesos versions 0.20.1 and 0.21.0

Master/Primary log:

==> /var/log/mesos/mesos-master.node1.invalid-user.log.WARNING.20150120-204548.12107 <==
Log file created at: 2015/01/20 20:45:48
Running on machine: node1
Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line] msg
F0120 20:45:48.025610 12116 master.cpp:1083] Recovery failed: Failed to recover registrar: Failed to perform fetch within 1mins

Slave log:

W0120 20:55:22.160058  9624 slave.cpp:2626] Master disconnected! Waiting for a new master to be elected
I0120 20:55:32.017480  9623 detector.cpp:138] Detected a new leader: (id='33')

Solution:

Fix MESOS_QUORUM size in /etc/default/mesos-master on the master node(s). In this case there was only 1 master so the value needed to be "1".

Then restart the service:

sudo service mesos-master restart

And all will be happy:

==> /var/log/mesos/mesos-master.INFO <==
I0120 20:55:32.012778 12683 network.hpp:424] ZooKeeper group memberships changed
I0120 20:55:32.013967 12683 group.cpp:659] Trying to get '/mesos/log_replicas/0000000032' in ZooKeeper
I0120 20:55:32.014857 12687 detector.cpp:138] Detected a new leader: (id='33')
I0120 20:55:32.015980 12687 group.cpp:659] Trying to get '/mesos/info_0000000033' in ZooKeeper
I0120 20:55:32.017633 12683 network.hpp:466] ZooKeeper group PIDs: { log-replica(1)@172.16.1.11:5050 }
I0120 20:55:32.018321 12687 detector.cpp:433] A new leading master (UPID=master@172.16.1.11:5050) is detected
I0120 20:55:32.018723 12687 master.cpp:1263] The newly elected leader is master@172.16.1.11:5050 with id 20150120-205522-184619180-5050-12676
I0120 20:55:32.019014 12687 master.cpp:1276] Elected as the leading master!
I0120 20:55:32.019261 12687 master.cpp:1094] Recovering from registrar
I0120 20:55:32.020227 12687 registrar.cpp:313] Recovering registrar
I0120 20:55:32.025616 12684 log.cpp:656] Attempting to start the writer
I0120 20:55:32.034173 12690 replica.cpp:474] Replica received implicit promise request with proposal 1
I0120 20:55:32.035416 12690 leveldb.cpp:306] Persisting metadata (8 bytes) to leveldb took 807088ns
I0120 20:55:32.035841 12690 replica.cpp:342] Persisted promised to 1
I0120 20:55:32.037963 12690 coordinator.cpp:230] Coordinator attemping to fill missing position
I0120 20:55:32.044366 12688 replica.cpp:375] Replica received explicit promise request for position 0 with proposal 2
I0120 20:55:32.045568 12688 leveldb.cpp:343] Persisting action (8 bytes) to leveldb took 684451ns
I0120 20:55:32.045905 12688 replica.cpp:676] Persisted action at 0
I0120 20:55:32.054538 12685 replica.cpp:508] Replica received write request for position 0
I0120 20:55:32.055205 12685 leveldb.cpp:438] Reading position from leveldb took 55275ns
I0120 20:55:32.056123 12685 leveldb.cpp:343] Persisting action (14 bytes) to leveldb took 594093ns
I0120 20:55:32.056448 12685 replica.cpp:676] Persisted action at 0
I0120 20:55:32.061266 12688 replica.cpp:655] Replica received learned notice for position 0
I0120 20:55:32.062516 12688 leveldb.cpp:343] Persisting action (16 bytes) to leveldb took 590888ns
I0120 20:55:32.062744 12688 replica.cpp:676] Persisted action at 0
I0120 20:55:32.063030 12688 replica.cpp:661] Replica learned NOP action at position 0
I0120 20:55:32.064226 12688 log.cpp:672] Writer started with ending position 0
I0120 20:55:32.072695 12689 leveldb.cpp:438] Reading position from leveldb took 59994ns
I0120 20:55:32.083608 12688 registrar.cpp:346] Successfully fetched the registry (0B) in 61.625088ms
I0120 20:55:32.084761 12688 registrar.cpp:445] Applied 1 operations in 84017ns; attempting to update the 'registry'
I0120 20:55:32.090651 12686 log.cpp:680] Attempting to append 118 bytes to the log
I0120 20:55:32.091244 12686 coordinator.cpp:340] Coordinator attempting to write APPEND action at position 1
I0120 20:55:32.092430 12683 replica.cpp:508] Replica received write request for position 1
I0120 20:55:32.093332 12683 leveldb.cpp:343] Persisting action (135 bytes) to leveldb took 578049ns
I0120 20:55:32.093626 12683 replica.cpp:676] Persisted action at 1
I0120 20:55:32.099520 12685 replica.cpp:655] Replica received learned notice for position 1
I0120 20:55:32.100365 12685 leveldb.cpp:343] Persisting action (137 bytes) to leveldb took 659046ns
I0120 20:55:32.100677 12685 replica.cpp:676] Persisted action at 1
I0120 20:55:32.100965 12685 replica.cpp:661] Replica learned APPEND action at position 1
I0120 20:55:32.109513 12690 log.cpp:699] Attempting to truncate the log to 1
I0120 20:55:32.110175 12690 coordinator.cpp:340] Coordinator attempting to write TRUNCATE action at position 2
I0120 20:55:32.111495 12690 replica.cpp:508] Replica received write request for position 2
I0120 20:55:32.111800 12689 registrar.cpp:490] Successfully updated the 'registry' in 24.368128ms
I0120 20:55:32.112483 12689 registrar.cpp:376] Successfully recovered registrar
I0120 20:55:32.113041 12690 leveldb.cpp:343] Persisting action (16 bytes) to leveldb took 1.141049ms
I0120 20:55:32.113315 12690 replica.cpp:676] Persisted action at 2
I0120 20:55:32.117593 12685 replica.cpp:655] Replica received learned notice for position 2
I0120 20:55:32.118618 12686 master.cpp:1121] Recovered 0 slaves from the Registry (82B) ; allowing 10mins for slaves to re-register
I0120 20:55:32.118962 12685 leveldb.cpp:343] Persisting action (18 bytes) to leveldb took 1.182186ms
I0120 20:55:32.119215 12685 leveldb.cpp:401] Deleting ~1 keys from leveldb took 41224ns
I0120 20:55:32.119474 12685 replica.cpp:676] Persisted action at 2
I0120 20:55:32.119738 12685 replica.cpp:661] Replica learned TRUNCATE action at position 2
I0120 20:55:32.144054 12688 master.cpp:3068] Registering slave at slave(1)@172.16.1.15:5050 (node5) with id 20150120-205522-184619180-5050-12676-S0
I0120 20:55:32.147260 12686 registrar.cpp:445] Applied 1 operations in 698375ns; attempting to update the 'registry'
I0120 20:55:32.151125 12690 log.cpp:680] Attempting to append 282 bytes to the log
I0120 20:55:32.151716 12683 coordinator.cpp:340] Coordinator attempting to write APPEND action at position 3
I0120 20:55:32.153053 12683 replica.cpp:508] Replica received write request for position 3
I0120 20:55:32.154202 12683 leveldb.cpp:343] Persisting action (301 bytes) to leveldb took 713538ns
I0120 20:55:32.154470 12683 replica.cpp:676] Persisted action at 3
I0120 20:55:32.157678 12689 replica.cpp:655] Replica received learned notice for position 3
I0120 20:55:32.158469 12689 leveldb.cpp:343] Persisting action (303 bytes) to leveldb took 680963ns
I0120 20:55:32.158684 12689 replica.cpp:676] Persisted action at 3
I0120 20:55:32.158946 12689 replica.cpp:661] Replica learned APPEND action at position 3
I0120 20:55:32.160116 12689 log.cpp:699] Attempting to truncate the log to 3
I0120 20:55:32.160517 12689 coordinator.cpp:340] Coordinator attempting to write TRUNCATE action at position 4
I0120 20:55:32.161438 12689 replica.cpp:508] Replica received write request for position 4
I0120 20:55:32.162703 12689 leveldb.cpp:343] Persisting action (16 bytes) to leveldb took 928859ns
I0120 20:55:32.162976 12689 replica.cpp:676] Persisted action at 4
I0120 20:55:32.170222 12684 replica.cpp:655] Replica received learned notice for position 4
I0120 20:55:32.170754 12688 master.cpp:1383] Received registration request for framework 'marathon-0.7.6' at scheduler-4bb6f980-5272-46f6-84f8-50cdbb01916d@172.16.1.11:39490
I0120 20:55:32.171977 12684 leveldb.cpp:343] Persisting action (18 bytes) to leveldb took 879271ns
I0120 20:55:32.172243 12684 leveldb.cpp:401] Deleting ~2 keys from leveldb took 44395ns
I0120 20:55:32.173485 12684 replica.cpp:676] Persisted action at 4
I0120 20:55:32.173954 12684 replica.cpp:661] Replica learned TRUNCATE action at position 4
I0120 20:55:32.171046 12686 registrar.cpp:490] Successfully updated the 'registry' in 23.20896ms
I0120 20:55:32.180392 12689 hierarchical_allocator_process.hpp:442] Added slave 20150120-205522-184619180-5050-12676-S0 (node5) with cpus(*):1; mem(*):496; disk(*):35164; ports(*):[31000-32000] (and cpus(*):1; mem(*):496; disk(*):35164; ports(*):[31000-32000] available)
I0120 20:55:32.181598 12688 master.cpp:3122] Registered slave 20150120-205522-184619180-5050-12676-S0 at slave(1)@172.16.1.15:5050 (node5) with cpus(*):1; mem(*):496; disk(*):35164; ports(*):[31000-32000]
I0120 20:55:32.182005 12688 master.cpp:1383] Received registration request for framework 'marathon-0.7.6' at scheduler-4bb6f980-5272-46f6-84f8-50cdbb01916d@172.16.1.11:39490
I0120 20:55:32.183660 12688 master.cpp:1447] Registering framework 20150120-205522-184619180-5050-12676-0000 (marathon-0.7.6) at scheduler-4bb6f980-5272-46f6-84f8-50cdbb01916d@172.16.1.11:39490
I0120 20:55:32.185972 12690 hierarchical_allocator_process.hpp:329] Added framework 20150120-205522-184619180-5050-12676-0000
I0120 20:55:32.189689 12688 master.cpp:1434] Framework 20150120-205522-184619180-5050-12676-0000 (marathon-0.7.6) at scheduler-4bb6f980-5272-46f6-84f8-50cdbb01916d@172.16.1.11:39490 already registered, resending acknowledgement

Related resources:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment