最近hadoop的集群挂掉了好几次,一开始不知道什么原因,后面发现我的zookeeper的集群出现了问题,在最后发现是我的hosts文件不知道什么时候配置改了,看了看日志才找到问题关键。

一、容易误导的常见zookeeper的WARN:

当启动node1上的zookeeper时,如果使用如下命令:

$ zkServer.sh status

会发现提示未正常运行。

其实这不是错误(注明了是WARN),查看日志会发现非常多的异常如下所示

postbird

我一开始天真的以为是ssh免密钥登录出了问题,后来才反应过来不是。

出现这种异常提示原因是,Zookeeper在启动的时候,会默认的去链接zoo.cfg中配置好的集群机器,如果连不上就报告WARN信息,而因为只开启了一台机器,不符合zookeeper的集群的规则(大于1的奇数台Zookeeper),因此status会提示未正常运行

当三个节点都启动成功之后,在最后一个节点上看日志则不会出现任何的错误和警告,如下面的黑色框内所示:

而此时查看status会告诉你哪台机器是follower,哪台机器是leader

2017-02-08 13:12:21,497 [myid:] - INFO  [main:QuorumPeerConfig@124] - Reading configuration from: /mnt/modules/zookeeper-3.4.9/bin/../conf/zoo.cfg
2017-02-08 13:12:21,643 [myid:] - INFO  [main:QuorumPeer$QuorumServer@149] - Resolved hostname: node1 to address: node1/192.168.124.131
2017-02-08 13:12:21,647 [myid:] - INFO  [main:QuorumPeer$QuorumServer@149] - Resolved hostname: node3 to address: node3/192.168.124.133
2017-02-08 13:12:21,649 [myid:] - INFO  [main:QuorumPeer$QuorumServer@149] - Resolved hostname: node2 to address: node2/192.168.124.132
2017-02-08 13:12:21,649 [myid:] - INFO  [main:QuorumPeerConfig@352] - Defaulting to majority quorums
2017-02-08 13:12:21,683 [myid:3] - INFO  [main:DatadirCleanupManager@78] - autopurge.snapRetainCount set to 3
2017-02-08 13:12:21,689 [myid:3] - INFO  [main:DatadirCleanupManager@79] - autopurge.purgeInterval set to 0
2017-02-08 13:12:21,689 [myid:3] - INFO  [main:DatadirCleanupManager@101] - Purge task is not scheduled.
2017-02-08 13:12:21,724 [myid:3] - INFO  [main:QuorumPeerMain@127] - Starting quorum peer
2017-02-08 13:12:21,920 [myid:3] - INFO  [main:NIOServerCnxnFactory@89] - binding to port 0.0.0.0/0.0.0.0:2181
2017-02-08 13:12:21,958 [myid:3] - INFO  [main:QuorumPeer@1019] - tickTime set to 2000
2017-02-08 13:12:21,961 [myid:3] - INFO  [main:QuorumPeer@1039] - minSessionTimeout set to -1
2017-02-08 13:12:21,962 [myid:3] - INFO  [main:QuorumPeer@1050] - maxSessionTimeout set to -1
2017-02-08 13:12:21,962 [myid:3] - INFO  [main:QuorumPeer@1065] - initLimit set to 10
2017-02-08 13:12:21,990 [myid:3] - INFO  [main:QuorumPeer@533] - currentEpoch not found! Creating with a reasonable default of 0. This should only happen when you are upgrading your installation
2017-02-08 13:12:22,002 [myid:3] - INFO  [main:QuorumPeer@548] - acceptedEpoch not found! Creating with a reasonable default of 0. This should only happen when you are upgrading your installation
2017-02-08 13:12:22,012 [myid:3] - INFO  [ListenerThread:QuorumCnxManager$Listener@534] - My election bind port: node3/192.168.124.133:3888
2017-02-08 13:12:22,024 [myid:3] - INFO  [QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:2181:QuorumPeer@774] - LOOKING
2017-02-08 13:12:22,025 [myid:3] - INFO  [QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@818] - New election. My id =  3, proposed zxid=0x0
2017-02-08 13:12:22,043 [myid:3] - INFO  [WorkerReceiver[myid=3]:FastLeaderElection@600] - Notification: 1 (message format version), 2 (n.leader), 0x0 (n.zxid), 0x1 (n.round), LOOKING (n.state), 1 (n.sid), 0x0 (n.peerEpoch) LOOKING (my state)
2017-02-08 13:12:22,044 [myid:3] - INFO  [WorkerReceiver[myid=3]:FastLeaderElection@600] - Notification: 1 (message format version), 2 (n.leader), 0x0 (n.zxid), 0x1 (n.round), FOLLOWING (n.state), 1 (n.sid), 0x1 (n.peerEpoch) LOOKING (my state)
2017-02-08 13:12:22,044 [myid:3] - INFO  [WorkerReceiver[myid=3]:FastLeaderElection@600] - Notification: 1 (message format version), 3 (n.leader), 0x0 (n.zxid), 0x1 (n.round), LOOKING (n.state), 3 (n.sid), 0x0 (n.peerEpoch) LOOKING (my state)
2017-02-08 13:12:22,045 [myid:3] - INFO  [WorkerReceiver[myid=3]:FastLeaderElection@600] - Notification: 1 (message format version), 2 (n.leader), 0x0 (n.zxid), 0x1 (n.round), LOOKING (n.state), 2 (n.sid), 0x0 (n.peerEpoch) LOOKING (my state)
2017-02-08 13:12:22,045 [myid:3] - INFO  [WorkerReceiver[myid=3]:FastLeaderElection@600] - Notification: 1 (message format version), 2 (n.leader), 0x0 (n.zxid), 0x1 (n.round), LEADING (n.state), 2 (n.sid), 0x1 (n.peerEpoch) LOOKING (my state)
2017-02-08 13:12:22,046 [myid:3] - INFO  [QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:2181:QuorumPeer@844] - FOLLOWING
2017-02-08 13:12:22,058 [myid:3] - INFO  [QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:2181:Learner@86] - TCP NoDelay set to: true
2017-02-08 13:12:22,078 [myid:3] - INFO  [QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:2181:Environment@100] - Server environment:zookeeper.version=3.4.9-1757313, built on 08/23/2016 06:50 GMT
2017-02-08 13:12:22,078 [myid:3] - INFO  [QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:2181:Environment@100] - Server environment:host.name=node3
2017-02-08 13:12:22,078 [myid:3] - INFO  [QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:2181:Environment@100] - Server environment:java.version=1.7.0_79
2017-02-08 13:12:22,078 [myid:3] - INFO  [QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:2181:Environment@100] - Server environment:java.vendor=Oracle Corporation
2017-02-08 13:12:22,078 [myid:3] - INFO  [QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:2181:Environment@100] - Server environment:java.home=/mnt/software/jdk1.7.0_79/jre
2017-02-08 13:12:22,078 [myid:3] - INFO  [QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:2181:Environment@100] - Server environment:java.class.path=/mnt/modules/zookeeper-3.4.9/bin/../build/classes:/mnt/modules/zookeeper-3.4.9/bin/../build/lib/*.jar:/mnt/modules/zookeeper-3.4.9/bin/../lib/slf4j-log4j12-1.6.1.jar:/mnt/modules/zookeeper-3.4.9/bin/../lib/slf4j-api-1.6.1.jar:/mnt/modules/zookeeper-3.4.9/bin/../lib/netty-3.10.5.Final.jar:/mnt/modules/zookeeper-3.4.9/bin/../lib/log4j-1.2.16.jar:/mnt/modules/zookeeper-3.4.9/bin/../lib/jline-0.9.94.jar:/mnt/modules/zookeeper-3.4.9/bin/../zookeeper-3.4.9.jar:/mnt/modules/zookeeper-3.4.9/bin/../src/java/lib/*.jar:/mnt/modules/zookeeper-3.4.9/bin/../conf:
2017-02-08 13:12:22,079 [myid:3] - INFO  [QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:2181:Environment@100] - Server environment:java.library.path=/usr/java/packages/lib/amd64:/usr/lib64:/lib64:/lib:/usr/lib
2017-02-08 13:12:22,079 [myid:3] - INFO  [QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:2181:Environment@100] - Server environment:java.io.tmpdir=/tmp
2017-02-08 13:12:22,079 [myid:3] - INFO  [QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:2181:Environment@100] - Server environment:java.compiler=<NA>
2017-02-08 13:12:22,079 [myid:3] - INFO  [QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:2181:Environment@100] - Server environment:os.name=Linux
2017-02-08 13:12:22,079 [myid:3] - INFO  [QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:2181:Environment@100] - Server environment:os.arch=amd64
2017-02-08 13:12:22,079 [myid:3] - INFO  [QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:2181:Environment@100] - Server environment:os.version=2.6.32-358.el6.x86_64
2017-02-08 13:12:22,079 [myid:3] - INFO  [QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:2181:Environment@100] - Server environment:user.name=root
2017-02-08 13:12:22,079 [myid:3] - INFO  [QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:2181:Environment@100] - Server environment:user.home=/root
2017-02-08 13:12:22,079 [myid:3] - INFO  [QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:2181:Environment@100] - Server environment:user.dir=/root/Desktop
2017-02-08 13:12:22,081 [myid:3] - INFO  [QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:2181:ZooKeeperServer@173] - Created server with tickTime 2000 minSessionTimeout 4000 maxSessionTimeout 40000 datadir /opt/zookeeper/version-2 snapdir /opt/zookeeper/version-2
2017-02-08 13:12:22,081 [myid:3] - INFO  [QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:2181:Follower@61] - FOLLOWING - LEADER ELECTION TOOK - 56
2017-02-08 13:12:22,084 [myid:3] - INFO  [QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:2181:QuorumPeer$QuorumServer@149] - Resolved hostname: node2 to address: node2/192.168.124.132
2017-02-08 13:12:22,089 [myid:3] - INFO  [QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:2181:Learner@329] - Getting a snapshot from leader
2017-02-08 13:12:22,093 [myid:3] - INFO  [QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:2181:FileTxnSnapLog@240] - Snapshotting: 0x100000000 to /opt/zookeeper/version-2/snapshot.100000000
2017-02-08 13:12:39,613 [myid:3] - INFO  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@192] - Accepted socket connection from /127.0.0.1:42881
2017-02-08 13:12:39,621 [myid:3] - INFO  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@827] - Processing srvr command from /127.0.0.1:42881
2017-02-08 13:12:39,626 [myid:3] - INFO  [Thread-1:NIOServerCnxn@1008] - Closed socket connection for client /127.0.0.1:42881 (no session established for client)

二、重新配置容易出错:

重新配置过程中再次打开很容易出现问题,需要将每个节点上配置的data的目录下的文件除了myid都应该删掉!

另外最坑爹的是最好重启一次!忠告。

三、推荐一篇写的不错的Zoo配置的文章:

http://blog.csdn.net/shirdrn/article/details/7183503