We have a node nifi cluster running with 3 zookeeper instances (replicated) in a Docker Swarm Cluster.
Most of time the cluster is operating fine, but from time to time we notice that Nifi stops processing messages completely. It eventually resumes after a while (sometimes after a couple of seconds, sometimes after a couple of minutes).
When I do a grep o.a.n.c.l.e.CuratorLeaderElectionManager /srv/nifi/logs/nifi-app.log on the primary node, I see a lof of suspended / reconnected messages.
2017-06-13 21:29:35,622 INFO [Curator-ConnectionStateManager-0] o.a.n.c.l.e.CuratorLeaderElectionManager org.apache.nifi.controller.leader.election.CuratorLeaderElectionManager$ElectionListener@778f44c6 Connection State changed to SUSPENDED
2017-06-13 21:29:35,623 INFO [Curator-ConnectionStateManager-0] o.a.n.c.l.e.CuratorLeaderElectionManager org.apache.nifi.controller.leader.election.CuratorLeaderElectionManager$ElectionListener@3485c1a5 Connection State changed to SUSPENDED
2017-06-13 21:29:35,628 INFO [Leader Election Notification Thread-1] o.a.n.c.l.e.CuratorLeaderElectionManager org.apache.nifi.controller.leader.election.CuratorLeaderElectionManager$ElectionListener@3485c1a5 has been interrupted; no longer leader for role 'Cluster Coordinator'
2017-06-13 21:29:35,628 INFO [Leader Election Notification Thread-4] o.a.n.c.l.e.CuratorLeaderElectionManager org.apache.nifi.controller.leader.election.CuratorLeaderElectionManager$ElectionListener@778f44c6 has been interrupted; no longer leader for role 'Primary Node'
2017-06-13 21:29:35,629 INFO [Leader Election Notification Thread-1] o.a.n.c.l.e.CuratorLeaderElectionManager org.apache.nifi.controller.leader.election.CuratorLeaderElectionManager$ElectionListener@3485c1a5 This node is no longer leader for role 'Cluster Coordinator'
2017-06-13 21:29:35,629 INFO [Leader Election Notification Thread-4] o.a.n.c.l.e.CuratorLeaderElectionManager org.apache.nifi.controller.leader.election.CuratorLeaderElectionManager$ElectionListener@778f44c6 This node is no longer leader for role 'Primary Node'
2017-06-13 21:29:38,154 INFO [Curator-ConnectionStateManager-0] o.a.n.c.l.e.CuratorLeaderElectionManager org.apache.nifi.controller.leader.election.CuratorLeaderElectionManager$ElectionListener@778f44c6 Connection State changed to RECONNECTED
2017-06-13 21:29:38,155 INFO [Curator-ConnectionStateManager-0] o.a.n.c.l.e.CuratorLeaderElectionManager org.apache.nifi.controller.leader.election.CuratorLeaderElectionManager$ElectionListener@3485c1a5 Connection State changed to RECONNECTED
2017-06-13 21:29:38,999 INFO [Leader Election Notification Thread-1] o.a.n.c.l.e.CuratorLeaderElectionManager org.apache.nifi.controller.leader.election.CuratorLeaderElectionManager$ElectionListener@3485c1a5 This node has been elected Leader for Role 'Cluster Coordinator'
2017-06-13 21:29:39,000 INFO [Leader Election Notification Thread-4] o.a.n.c.l.e.CuratorLeaderElectionManager org.apache.nifi.controller.leader.election.CuratorLeaderElectionManager$ElectionListener@778f44c6 This node has been elected Leader for Role 'Primary Node'
2017-06-13 21:36:38,618 INFO [Curator-ConnectionStateManager-0] o.a.n.c.l.e.CuratorLeaderElectionManager org.apache.nifi.controller.leader.election.CuratorLeaderElectionManager$ElectionListener@778f44c6 Connection State changed to SUSPENDED
2017-06-13 21:36:38,618 INFO [Curator-ConnectionStateManager-0] o.a.n.c.l.e.CuratorLeaderElectionManager org.apache.nifi.controller.leader.election.CuratorLeaderElectionManager$ElectionListener@3485c1a5 Connection State changed to SUSPENDED
2017-06-13 21:36:38,626 INFO [Leader Election Notification Thread-4] o.a.n.c.l.e.CuratorLeaderElectionManager org.apache.nifi.controller.leader.election.CuratorLeaderElectionManager$ElectionListener@778f44c6 has been interrupted; no longer leader for role 'Primary Node'
2017-06-13 21:36:38,626 INFO [Leader Election Notification Thread-4] o.a.n.c.l.e.CuratorLeaderElectionManager org.apache.nifi.controller.leader.election.CuratorLeaderElectionManager$ElectionListener@778f44c6 This node is no longer leader for role 'Primary Node'
2017-06-13 21:36:38,626 INFO [Leader Election Notification Thread-1] o.a.n.c.l.e.CuratorLeaderElectionManager org.apache.nifi.controller.leader.election.CuratorLeaderElectionManager$ElectionListener@3485c1a5 has been interrupted; no longer leader for role 'Cluster Coordinator'
2017-06-13 21:36:38,626 INFO [Leader Election Notification Thread-1] o.a.n.c.l.e.CuratorLeaderElectionManager org.apache.nifi.controller.leader.election.CuratorLeaderElectionManager$ElectionListener@3485c1a5 This node is no longer leader for role 'Cluster Coordinator'
2017-06-13 21:36:39,920 INFO [Curator-ConnectionStateManager-0] o.a.n.c.l.e.CuratorLeaderElectionManager org.apache.nifi.controller.leader.election.CuratorLeaderElectionManager$ElectionListener@778f44c6 Connection State changed to RECONNECTED
2017-06-13 21:36:39,920 INFO [Curator-ConnectionStateManager-0] o.a.n.c.l.e.CuratorLeaderElectionManager org.apache.nifi.controller.leader.election.CuratorLeaderElectionManager$ElectionListener@3485c1a5 Connection State changed to RECONNECTED
2017-06-13 21:36:42,879 INFO [Leader Election Notification Thread-1] o.a.n.c.l.e.CuratorLeaderElectionManager org.apache.nifi.controller.leader.election.CuratorLeaderElectionManager$ElectionListener@3485c1a5 This node has been elected Leader for Role 'Cluster Coordinator'
2017-06-13 21:36:42,903 INFO [Leader Election Notification Thread-4] o.a.n.c.l.e.CuratorLeaderElectionManager org.apache.nifi.controller.leader.election.CuratorLeaderElectionManager$ElectionListener@778f44c6 This node has been elected Leader for Role 'Primary Node'
Likewise on the other node, I see similar messages
2017-06-13 21:36:21,846 INFO [Curator-ConnectionStateManager-0] o.a.n.c.l.e.CuratorLeaderElectionManager org.apache.nifi.controller.leader.election.CuratorLeaderElectionManager$ElectionListener@251ef4ed Connection State changed to SUSPENDED
2017-06-13 21:36:22,183 INFO [Curator-ConnectionStateManager-0] o.a.n.c.l.e.CuratorLeaderElectionManager org.apache.nifi.controller.leader.election.CuratorLeaderElectionManager$ElectionListener@751a16d6 Connection State changed to SUSPENDED
2017-06-13 21:36:36,203 INFO [Curator-ConnectionStateManager-0] o.a.n.c.l.e.CuratorLeaderElectionManager org.apache.nifi.controller.leader.election.CuratorLeaderElectionManager$ElectionListener@251ef4ed Connection State changed to RECONNECTED
2017-06-13 21:36:36,203 INFO [Curator-ConnectionStateManager-0] o.a.n.c.l.e.CuratorLeaderElectionManager org.apache.nifi.controller.leader.election.CuratorLeaderElectionManager$ElectionListener@751a16d6 Connection State changed to RECONNECTED
2017-06-13 21:36:38,974 INFO [Curator-ConnectionStateManager-0] o.a.n.c.l.e.CuratorLeaderElectionManager org.apache.nifi.controller.leader.election.CuratorLeaderElectionManager$ElectionListener@251ef4ed Connection State changed to SUSPENDED
2017-06-13 21:36:39,022 INFO [Curator-ConnectionStateManager-0] o.a.n.c.l.e.CuratorLeaderElectionManager org.apache.nifi.controller.leader.election.CuratorLeaderElectionManager$ElectionListener@751a16d6 Connection State changed to SUSPENDED
2017-06-13 21:36:39,089 INFO [Curator-ConnectionStateManager-0] o.a.n.c.l.e.CuratorLeaderElectionManager org.apache.nifi.controller.leader.election.CuratorLeaderElectionManager$ElectionListener@251ef4ed Connection State changed to LOST
2017-06-13 21:36:39,092 INFO [Curator-ConnectionStateManager-0] o.a.n.c.l.e.CuratorLeaderElectionManager org.apache.nifi.controller.leader.election.CuratorLeaderElectionManager$ElectionListener@751a16d6 Connection State changed to LOST
2017-06-13 21:36:40,327 INFO [Curator-ConnectionStateManager-0] o.a.n.c.l.e.CuratorLeaderElectionManager org.apache.nifi.controller.leader.election.CuratorLeaderElectionManager$ElectionListener@251ef4ed Connection State changed to RECONNECTED
2017-06-13 21:36:40,327 INFO [Curator-ConnectionStateManager-0] o.a.n.c.l.e.CuratorLeaderElectionManager org.apache.nifi.controller.leader.election.CuratorLeaderElectionManager$ElectionListener@751a16d6 Connection State changed to RECONNECTED
2017-06-13 21:37:07,170 INFO [Curator-ConnectionStateManager-0] o.a.n.c.l.e.CuratorLeaderElectionManager org.apache.nifi.controller.leader.election.CuratorLeaderElectionManager$ElectionListener@251ef4ed Connection State changed to SUSPENDED
2017-06-13 21:37:07,170 INFO [Curator-ConnectionStateManager-0] o.a.n.c.l.e.CuratorLeaderElectionManager org.apache.nifi.controller.leader.election.CuratorLeaderElectionManager$ElectionListener@751a16d6 Connection State changed to SUSPENDED
2017-06-13 21:37:08,026 INFO [Curator-ConnectionStateManager-0] o.a.n.c.l.e.CuratorLeaderElectionManager org.apache.nifi.controller.leader.election.CuratorLeaderElectionManager$ElectionListener@251ef4ed Connection State changed to RECONNECTED
2017-06-13 21:37:08,027 INFO [Curator-ConnectionStateManager-0] o.a.n.c.l.e.CuratorLeaderElectionManager org.apache.nifi.controller.leader.election.CuratorLeaderElectionManager$ElectionListener@751a16d6 Connection State changed to RECONNECTED
2017-06-13 21:41:40,436 INFO [Curator-ConnectionStateManager-0] o.a.n.c.l.e.CuratorLeaderElectionManager org.apache.nifi.controller.leader.election.CuratorLeaderElectionManager$ElectionListener@251ef4ed Connection State changed to SUSPENDED
2017-06-13 21:41:40,585 INFO [Curator-ConnectionStateManager-0] o.a.n.c.l.e.CuratorLeaderElectionManager org.apache.nifi.controller.leader.election.CuratorLeaderElectionManager$ElectionListener@751a16d6 Connection State changed to SUSPENDED
2017-06-13 21:41:42,151 INFO [Curator-ConnectionStateManager-0] o.a.n.c.l.e.CuratorLeaderElectionManager org.apache.nifi.controller.leader.election.CuratorLeaderElectionManager$ElectionListener@251ef4ed Connection State changed to LOST
2017-06-13 21:41:42,151 INFO [Curator-ConnectionStateManager-0] o.a.n.c.l.e.CuratorLeaderElectionManager org.apache.nifi.controller.leader.election.CuratorLeaderElectionManager$ElectionListener@751a16d6 Connection State changed to LOST
2017-06-13 21:41:43,278 INFO [Curator-ConnectionStateManager-0] o.a.n.c.l.e.CuratorLeaderElectionManager org.apache.nifi.controller.leader.election.CuratorLeaderElectionManager$ElectionListener@251ef4ed Connection State changed to RECONNECTED
2017-06-13 21:41:43,278 INFO [Curator-ConnectionStateManager-0] o.a.n.c.l.e.CuratorLeaderElectionManager org.apache.nifi.controller.leader.election.CuratorLeaderElectionManager$ElectionListener@751a16d6 Connection State changed to RECONNECTED
The only real exceptions I'm seeing in the logs are these
2017-06-13 21:03:26,829 ERROR [Curator-Framework-0] o.a.c.f.imps.CuratorFrameworkImpl Background retry gave up
org.apache.curator.CuratorConnectionLossException: KeeperErrorCode = ConnectionLoss
at org.apache.curator.framework.imps.CuratorFrameworkImpl.performBackgroundOperation(CuratorFrameworkImpl.java:838)
at org.apache.curator.framework.imps.CuratorFrameworkImpl.backgroundOperationsLoop(CuratorFrameworkImpl.java:809)
at org.apache.curator.framework.imps.CuratorFrameworkImpl.access$300(CuratorFrameworkImpl.java:64)
at org.apache.curator.framework.imps.CuratorFrameworkImpl$4.call(CuratorFrameworkImpl.java:267)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
I also this on the UI from time to time :
com.sun.jersey.api.client.ClientHandlerException: java.net.SocketTimeoutException: Read timed out
Is there anything I can do to further debug this ? Is it normal to see that many connection state changes ? (the logs are full of them). The solution is running on 3 VMs, using Docker Swarm. Nifi is running on 2 of those 3 VMs. We have a zookeeper setup running on all 3 VMs.
I don't see any errors in the zookeeper logs.