wiki:TipAndDoc/HA

Version 14 (modified by mitty, 8 years ago) (diff)

--

High Availability

tutorial

DRBD

  • Distributed Replicated Block Device

Pacemaker

corosync

interface FAULTY

  • 複数のNICでredundant構成の時、あるinterfaceがダウンすると、problem counter(デフォルトでは10)のカウントダウンが始まる。0になると、そのinterfaceはFAULTYとして以後使わなくなる。
    Jul  1 18:10:00 debian-hab corosync[1377]:   [TOTEM ] Incrementing problem counter for seqid 850 iface 172.16.0.209 to [1 of 10]
    Jul  1 18:10:00 debian-hab corosync[1377]:   [TOTEM ] Incrementing problem counter for seqid 852 iface 172.16.0.209 to [2 of 10]
    
    (snip)
    
    Jul  1 18:10:11 debian-hab corosync[1377]:   [TOTEM ] Incrementing problem counter for seqid 876 iface 172.16.0.209 to [9 of 10]
    Jul  1 18:10:11 debian-hab corosync[1377]:   [TOTEM ] Incrementing problem counter for seqid 878 iface 172.16.0.209 to [10 of 10]
    Jul  1 18:10:11 debian-hab corosync[1377]:   [TOTEM ] Marking seqid 878 ringid 1 interface 172.16.0.209 FAULTY - adminisrtative intervention required.
    
  • FAULTYになったあとリンクが復活しても、corosync-cfgtool -rで手動で戻す必要がある。(ノードの再起動では戻らない。全ノードをタイミングをずらして再起動すると戻る場合はあるが、タイミングがよく分からない)
    • mitty@debian-hab:~$ sudo corosync-cfgtool -s
      Printing ring status.
      Local node ID 1358997696
      RING ID 0
              id      = 192.168.0.209
              status  = ring 0 active with no faults
      RING ID 1
              id      = 172.16.0.209
              status  = Marking seqid 24 ringid 1 interface 172.16.0.209 FAULTY - adminisrtative intervention required.
      
    • mitty@debian-hab:~$ sudo corosync-cfgtool -r
      Re-enabling all failed rings.
      
    • mitty@debian-hab:~$ sudo corosync-cfgtool -s
      Printing ring status.
      Local node ID 1358997696
      RING ID 0
              id      = 192.168.0.209
              status  = ring 0 active with no faults
      RING ID 1
              id      = 172.16.0.209
              status  = ring 1 active with no faults
      
  • problem counterの値は、rrp_problem_count_thresholdで変更出来る。
    • /etc/corosync/corosync.conf
      totem {
      
      (snip)
      
      	rrp_problem_count_threshold: 1000
      
  • 二つのinterfaceがある状態で、二つともダウンすると、problem counterのカウントダウンは停止し、その後(何故か)no faultsに戻る。
    • ring 1 -> ring 0の順でダウン
      Jul  1 18:30:21 debian-hab corosync[1424]:   [TOTEM ] Incrementing problem counter for seqid 8312 iface 172.16.0.209 to [1 of 1000]
      Jul  1 18:30:22 debian-hab corosync[1424]:   [TOTEM ] Incrementing problem counter for seqid 8314 iface 172.16.0.209 to [2 of 1000]
      Jul  1 18:30:22 debian-hab corosync[1424]:   [TOTEM ] Incrementing problem counter for seqid 8316 iface 172.16.0.209 to [3 of 1000]
      Jul  1 18:30:23 debian-hab corosync[1424]:   [TOTEM ] Decrementing problem counter for iface 172.16.0.209 to [2 of 1000]
      Jul  1 18:30:23 debian-hab corosync[1424]:   [TOTEM ] Incrementing problem counter for seqid 8318 iface 172.16.0.209 to [3 of 1000]
      Jul  1 18:30:24 debian-hab corosync[1424]:   [TOTEM ] Incrementing problem counter for seqid 8320 iface 172.16.0.209 to [4 of 1000]
      Jul  1 18:30:25 debian-hab corosync[1424]:   [TOTEM ] Decrementing problem counter for iface 172.16.0.209 to [3 of 1000]
      Jul  1 18:30:25 debian-hab corosync[1424]:   [TOTEM ] Incrementing problem counter for seqid 8322 iface 172.16.0.209 to [4 of 1000]
      Jul  1 18:30:26 debian-hab corosync[1424]:   [TOTEM ] Incrementing problem counter for seqid 8324 iface 172.16.0.209 to [5 of 1000]
      Jul  1 18:30:27 debian-hab corosync[1424]:   [TOTEM ] Incrementing problem counter for seqid 8326 iface 172.16.0.209 to [6 of 1000]
      Jul  1 18:30:27 debian-hab corosync[1424]:   [TOTEM ] Decrementing problem counter for iface 172.16.0.209 to [5 of 1000]
      Jul  1 18:30:27 debian-hab corosync[1424]:   [TOTEM ] Incrementing problem counter for seqid 8328 iface 172.16.0.209 to [6 of 1000]
      Jul  1 18:30:29 debian-hab corosync[1424]:   [TOTEM ] Decrementing problem counter for iface 172.16.0.209 to [5 of 1000]
      Jul  1 18:30:31 debian-hab corosync[1424]:   [TOTEM ] A processor failed, forming new configuration.                                      <---- ここで二つ目のリンクもダウン、UNCLEAN(offline)へ
      Jul  1 18:30:31 debian-hab corosync[1424]:   [TOTEM ] Decrementing problem counter for iface 172.16.0.209 to [4 of 1000]
      Jul  1 18:30:33 debian-hab corosync[1424]:   [TOTEM ] Decrementing problem counter for iface 172.16.0.209 to [3 of 1000]
      
      (snip)
      
      Jul  1 18:30:35 debian-hab corosync[1424]:   [TOTEM ] Decrementing problem counter for iface 172.16.0.209 to [2 of 1000]
      Jul  1 18:30:37 debian-hab corosync[1424]:   [TOTEM ] Decrementing problem counter for iface 172.16.0.209 to [1 of 1000]
      Jul  1 18:30:39 debian-hab corosync[1424]:   [TOTEM ] ring 1 active with no faults
      
    • どちらかのリンクが復活すると、復活していない方のカウントダウンが再開する。
    • ring 1が復活
      Jul  1 18:34:56 debian-hab corosync[1424]:   [TOTEM ] Incrementing problem counter for seqid 2 iface 192.168.0.209 to [1 of 1000]
      Jul  1 18:34:57 debian-hab corosync[1424]:   [TOTEM ] Incrementing problem counter for seqid 4 iface 192.168.0.209 to [2 of 1000]
      Jul  1 18:34:58 debian-hab corosync[1424]:   [TOTEM ] Incrementing problem counter for seqid 6 iface 192.168.0.209 to [3 of 1000]
      
      (snip)
      
      Jul  1 18:35:02 debian-hab pengine: [1435]: info: determine_online_status: Node debian-haa is online
      Jul  1 18:35:02 debian-hab pengine: [1435]: info: determine_online_status: Node debian-hab is online
      
      
      (snip)
      
      Jul  1 18:35:09 debian-hab corosync[1424]:   [TOTEM ] Incrementing problem counter for seqid 42 iface 192.168.0.209 to [15 of 1000]
      Jul  1 18:35:09 debian-hab corosync[1424]:   [MAIN  ] Completed service synchronization, ready to provide service.
      Jul  1 18:35:09 debian-hab corosync[1424]:   [TOTEM ] Incrementing problem counter for seqid 44 iface 192.168.0.209 to [16 of 1000]
      
    • ring 0も復活すると、片方がダウン->problem counterが閾値を越える前に回復、と同じように元に戻る
      Jul  1 18:35:19 debian-hab corosync[1424]:   [TOTEM ] Incrementing problem counter for seqid 76 iface 192.168.0.209 to [27 of 1000]
      Jul  1 18:35:21 debian-hab corosync[1424]:   [TOTEM ] Decrementing problem counter for iface 192.168.0.209 to [26 of 1000]
      
      (snip)
      
      Jul  1 18:36:09 debian-hab corosync[1424]:   [TOTEM ] Decrementing problem counter for iface 192.168.0.209 to [2 of 1000]
      Jul  1 18:36:11 debian-hab corosync[1424]:   [TOTEM ] Decrementing problem counter for iface 192.168.0.209 to [1 of 1000]
      Jul  1 18:36:13 debian-hab corosync[1424]:   [TOTEM ] ring 0 active with no faults