[[PageOutline]]
[[TitleIndex(TipAndDoc/HA,format=group)]]

 = High Availability =
 * [http://www.valinux.co.jp/contents/tech/techlib/eos/drbd_heartbeat/drbd_heartbeat_01.html DRBD+Heartbeatでお手軽HA Cluster　VA Linux Systems Japan]
 * [http://d.hatena.ne.jp/n314/20100408/1270707324 ソフトウェアRAIDとDRBD作業ログ - より良い環境を求めて]
 * [http://d.hatena.ne.jp/n314/20100412/1271074044 heartbeatとDRBD作業ログ + webmin - より良い環境を求めて]
 * [http://sourceforge.jp/projects/linux-ha/lists/archive/japan/2008-February/000047.html HEARTBEATで使用可能なコマンドが知りたい。 (Linux-ha-jp) - Linux-HA Japan - SourceForge.JP]
 * [http://plaza.rakuten.co.jp/pirorin55/diary/200604270000/ RE: 27日の日記 - Pirorin - 楽天ブログ（Blog）] - HeartBeat とDRBD によるHAクラスタの構築
 * [http://linux-ha.sourceforge.jp/wp/ Linux-HA Japan]
 * [http://cms-blog.mitsue.co.jp/archives/000238.html DRBDとHeartbeat　(1)　サービス稼働率の向上 | CMS Blog | ミツエーリンクス]

 == tutorial ==
 * [https://help.ubuntu.com/community/HighlyAvailableNFS HighlyAvailableNFS - Community Ubuntu Documentation]
 * [https://help.ubuntu.com/community/HighlyAvailableiSCSITarget HighlyAvailableiSCSITarget - Community Ubuntu Documentation]

 = DRBD =
 * Distributed Replicated Block Device

 * http://www.drbd.jp/users-guide/users-guide.html
 * http://www.drbd.org/users-guide/

 * [https://help.ubuntu.com/10.04/serverguide/C/drbd.html DRBD]
 * [http://envotechie.wordpress.com/2010/12/07/installing-and-configuring-drbd-on-ubuntu-10-04/ Installing and Configuring DRBD on Ubuntu 10.04 « Johnson's Blog]
 * [http://www.dollpaper.com/info/460.html Ubuntu10.04でクラスタ環境(Heartbeat+DRBD) | サラトガIT日記]
 * [http://lost-and-found-narihiro.blogspot.com/2010/10/ubuntu-1004-tls-drbd.html lost and found ( for me ? ): Ubuntu 10.04 TLS : DRBD]
 > apt で DRBD をインストールできるように、sources.list に1行目を追加
  * これは必要か不明
  * http://ppa.launchpad.net/ubuntu-ha/lucid-cluster/ubuntu/dists/lucid/main/binary-i386/Packages 等を見た限り「Source: redhat-cluster」となっているもの以外はppaから追加しなくても良さそう
 * [http://www.atmarkit.co.jp/flinux/special/drbd01/drbd01a.html ＠IT：DRBD＋iSCSI夢の共演（前編）（1/3）]
 * [http://d.hatena.ne.jp/y_fudi/20081224/1230111762 drbdのスプリットブレイン訓練 - お仕事日記。]
 * [http://www.drbd.org/users-guide/s-resolve-split-brain.html Manual split brain recovery]

 * [https://cybozu.atlassian.net/wiki/pages/viewpage.action?pageId=5701634 DRBD 調査メモ - 日本語公開記事 - サイボウズエンジニアのWIKI]

 = Pacemaker =
 * [http://linux-ha.sourceforge.jp/wp/dl/pminstall_cent5 Pacemakerインストール方法 CentOS 5編 « Linux-HA Japan]
 * [http://linux-ha.sourceforge.jp/wp/archives/441 PacemakerとDRBDでサーバー構築してみよう（動画デモ） « Linux-HA Japan]
 * [http://www.srchack.org/article.php?story=20110211234742375 Corosync(Slackware 13.1) - ＠SRCHACK.ORG（えす・あーる・しー・はっく）]
 * [http://library.linode.com/linux-ha/highly-available-file-database-server-ubuntu-10.04 Build a Highly Available NFS/MySQL/PostgreSQL Server on Ubuntu 10.04 LTS (Lucid) – Linode Library]
 * [http://www.clusterlabs.org/doc/en-US/Pacemaker/1.1/html/Clusters_from_Scratch/ Clusters from Scratch]
 > Creating Active/Passive and Active/Active Clusters on Fedora
  * 設定項目についての説明が比較的されている
 * [http://www.clusterlabs.org/doc/crm_cli.html CRM CLI (command line interface) tool]
  * crmコマンドの構文規則と使い方

 * 設定例
  * [http://www.clusterlabs.org/wiki/Example_configurations Example configurations - ClusterLabs]
   * 基礎的な説明と例
  * [http://www.clusterlabs.org/wiki/DRBD_HowTo_1.0 DRBD HowTo 1.0 - ClusterLabs]
 * 構築例
  * [http://d.hatena.ne.jp/civicivi/20110218/1298025235 Heartbeat v3 + Pacemaker 再び - とあるサーバ管理者 & PG & SEのメモ帳]
   * 2 nodes / DRBD / Apache / PostgreSQL
   * 良くまとまっている
  * [http://www.atmarkit.co.jp/flinux/special/linuxha01/01c.html ＠IT：体験！ 新しくなったLinux-HA（3/3）]
   * 2 nodes / DRBD / Apache
  * はじめから終わりまで通して設定例が書かれている
   * [http://docs.homelinux.org/doku.php?id=create_high-available_drbd_device_with_pacemaker Create High-Available DRBD Device with Pacemaker]
   * [http://docs.homelinux.org/doku.php?id=create_high-available_nfs_server_with_pacemaker Create High-Available NFS Server with Pacemaker]
   * [http://library.linode.com/linux-ha/highly-available-file-database-server-ubuntu-10.04 Build a Highly Available NFS/MySQL/PostgreSQL Server on Ubuntu 10.04 LTS (Lucid) – Linode Library]
  * [http://oss.clusterlabs.org/pipermail/pacemaker/2009-November/003466.html |Pacemaker| cluster doesn't failover - log at the end of msg]
  > You need to use a location constraint on your NFS resource and a ping/pingd resource to monitor network connectivity. Combined together you can make NFS or your DRBD Master resource constrained to nodes that have network connectivity. See http://www.clusterlabs.org/wiki/Example_configurations#Set_up_pingd.
   * 複数の回線でクラスタの死活チェック、一つの線でHAサービス提供、という構成時に、サービス回線だけが落ちたときに、クラスタの死活チェックは他の回線で迂回して出来てしまうので、そのままではフェイルオーバーしないよ、という問題に対する回答
 * [http://linux-ha.sourceforge.jp/wp/archives/1657/2 月刊あんどりゅーくん(6月号) « Linux-HA Japan]
 > 日本語コミュニティのメーリングリストで、故障契機の話題がでていましたがそういえば、Pacemakerでは「migration-threshold」というパラメータが導入されておるのです。
 * [http://linux-ha.sourceforge.jp/wp/archives/1809/3 月刊あんどりゅーくん(7月号) « Linux-HA Japan]
 > 今月は、故障回数が migration-threshold に達してフェイルオーバが発生した後の復旧手順を紹介します。

 == corosync ==
 * [http://web.archiveorange.com/archive/v/yYk4BF4JNlUlPQSLhaMo corosync ring marked FAULTY - administrative intervention required - Open SA Forum AIS Services mailing list - ArchiveOrange]

 * [https://lists.linux-foundation.org/pipermail/openais/2011-March/015848.html |Openais| |PATCH| Implementation of automatic redundant ring recovery]
  * 将来的には下記の[#interfaceFAULTY]の問題はこのパッチが取り込まれて解決するかも

 === interface FAULTY ===
 * 複数のNICでredundant構成の時、あるinterfaceがダウンすると、problem counter(デフォルトでは10)のカウントダウンが始まる。0になると、そのinterfaceはFAULTYとして以後使わなくなる。
{{{
Jul  1 18:10:00 debian-hab corosync[1377]:   [TOTEM ] Incrementing problem counter for seqid 850 iface 172.16.0.209 to [1 of 10]
Jul  1 18:10:00 debian-hab corosync[1377]:   [TOTEM ] Incrementing problem counter for seqid 852 iface 172.16.0.209 to [2 of 10]

(snip)

Jul  1 18:10:11 debian-hab corosync[1377]:   [TOTEM ] Incrementing problem counter for seqid 876 iface 172.16.0.209 to [9 of 10]
Jul  1 18:10:11 debian-hab corosync[1377]:   [TOTEM ] Incrementing problem counter for seqid 878 iface 172.16.0.209 to [10 of 10]
Jul  1 18:10:11 debian-hab corosync[1377]:   [TOTEM ] Marking seqid 878 ringid 1 interface 172.16.0.209 FAULTY - adminisrtative intervention required.
}}}
 * FAULTYになったあとリンクが復活しても、corosync-cfgtool -rで手動で戻す必要がある。(ノードの再起動では戻らない。全ノードをタイミングをずらして再起動すると戻る場合はあるが、タイミングがよく分からない)
  * mitty@debian-hab:~$ sudo corosync-cfgtool -s
{{{
Printing ring status.
Local node ID 1358997696
RING ID 0
        id      = 192.168.0.209
        status  = ring 0 active with no faults
RING ID 1
        id      = 172.16.0.209
        status  = Marking seqid 24 ringid 1 interface 172.16.0.209 FAULTY - adminisrtative intervention required.
}}}
  * mitty@debian-hab:~$ sudo corosync-cfgtool -r
{{{
Re-enabling all failed rings.
}}}
  * mitty@debian-hab:~$ sudo corosync-cfgtool -s
{{{
Printing ring status.
Local node ID 1358997696
RING ID 0
        id      = 192.168.0.209
        status  = ring 0 active with no faults
RING ID 1
        id      = 172.16.0.209
        status  = ring 1 active with no faults
}}}

 * problem counterの値は、rrp_problem_count_thresholdで変更出来る。
  * /etc/corosync/corosync.conf
{{{
totem {

(snip)

	rrp_problem_count_threshold: 1000
}}}

 * 二つのinterfaceがある状態で、二つともダウンすると、problem counterのカウントダウンは停止し、その後(何故か)no faultsに戻る。
  * ring 1 -> ring 0の順でダウン
{{{
Jul  1 18:30:21 debian-hab corosync[1424]:   [TOTEM ] Incrementing problem counter for seqid 8312 iface 172.16.0.209 to [1 of 1000]
Jul  1 18:30:22 debian-hab corosync[1424]:   [TOTEM ] Incrementing problem counter for seqid 8314 iface 172.16.0.209 to [2 of 1000]
Jul  1 18:30:22 debian-hab corosync[1424]:   [TOTEM ] Incrementing problem counter for seqid 8316 iface 172.16.0.209 to [3 of 1000]
Jul  1 18:30:23 debian-hab corosync[1424]:   [TOTEM ] Decrementing problem counter for iface 172.16.0.209 to [2 of 1000]
Jul  1 18:30:23 debian-hab corosync[1424]:   [TOTEM ] Incrementing problem counter for seqid 8318 iface 172.16.0.209 to [3 of 1000]
Jul  1 18:30:24 debian-hab corosync[1424]:   [TOTEM ] Incrementing problem counter for seqid 8320 iface 172.16.0.209 to [4 of 1000]
Jul  1 18:30:25 debian-hab corosync[1424]:   [TOTEM ] Decrementing problem counter for iface 172.16.0.209 to [3 of 1000]
Jul  1 18:30:25 debian-hab corosync[1424]:   [TOTEM ] Incrementing problem counter for seqid 8322 iface 172.16.0.209 to [4 of 1000]
Jul  1 18:30:26 debian-hab corosync[1424]:   [TOTEM ] Incrementing problem counter for seqid 8324 iface 172.16.0.209 to [5 of 1000]
Jul  1 18:30:27 debian-hab corosync[1424]:   [TOTEM ] Incrementing problem counter for seqid 8326 iface 172.16.0.209 to [6 of 1000]
Jul  1 18:30:27 debian-hab corosync[1424]:   [TOTEM ] Decrementing problem counter for iface 172.16.0.209 to [5 of 1000]
Jul  1 18:30:27 debian-hab corosync[1424]:   [TOTEM ] Incrementing problem counter for seqid 8328 iface 172.16.0.209 to [6 of 1000]
Jul  1 18:30:29 debian-hab corosync[1424]:   [TOTEM ] Decrementing problem counter for iface 172.16.0.209 to [5 of 1000]
Jul  1 18:30:31 debian-hab corosync[1424]:   [TOTEM ] A processor failed, forming new configuration.                                      <---- ここで二つ目のリンクもダウン、UNCLEAN(offline)へ
Jul  1 18:30:31 debian-hab corosync[1424]:   [TOTEM ] Decrementing problem counter for iface 172.16.0.209 to [4 of 1000]
Jul  1 18:30:33 debian-hab corosync[1424]:   [TOTEM ] Decrementing problem counter for iface 172.16.0.209 to [3 of 1000]

(snip)

Jul  1 18:30:35 debian-hab corosync[1424]:   [TOTEM ] Decrementing problem counter for iface 172.16.0.209 to [2 of 1000]
Jul  1 18:30:37 debian-hab corosync[1424]:   [TOTEM ] Decrementing problem counter for iface 172.16.0.209 to [1 of 1000]
Jul  1 18:30:39 debian-hab corosync[1424]:   [TOTEM ] ring 1 active with no faults
}}}
  * どちらかのリンクが復活すると、復活していない方のカウントダウンが再開する。
  * ring 1が復活
{{{

Jul  1 18:34:56 debian-hab corosync[1424]:   [TOTEM ] Incrementing problem counter for seqid 2 iface 192.168.0.209 to [1 of 1000]
Jul  1 18:34:57 debian-hab corosync[1424]:   [TOTEM ] Incrementing problem counter for seqid 4 iface 192.168.0.209 to [2 of 1000]
Jul  1 18:34:58 debian-hab corosync[1424]:   [TOTEM ] Incrementing problem counter for seqid 6 iface 192.168.0.209 to [3 of 1000]

(snip)

Jul  1 18:35:02 debian-hab pengine: [1435]: info: determine_online_status: Node debian-haa is online
Jul  1 18:35:02 debian-hab pengine: [1435]: info: determine_online_status: Node debian-hab is online


(snip)

Jul  1 18:35:09 debian-hab corosync[1424]:   [TOTEM ] Incrementing problem counter for seqid 42 iface 192.168.0.209 to [15 of 1000]
Jul  1 18:35:09 debian-hab corosync[1424]:   [MAIN  ] Completed service synchronization, ready to provide service.
Jul  1 18:35:09 debian-hab corosync[1424]:   [TOTEM ] Incrementing problem counter for seqid 44 iface 192.168.0.209 to [16 of 1000]
}}}
  * ring 0も復活すると、片方がダウン->problem counterが閾値を越える前に回復、と同じように元に戻る
{{{

Jul  1 18:35:19 debian-hab corosync[1424]:   [TOTEM ] Incrementing problem counter for seqid 76 iface 192.168.0.209 to [27 of 1000]
Jul  1 18:35:21 debian-hab corosync[1424]:   [TOTEM ] Decrementing problem counter for iface 192.168.0.209 to [26 of 1000]

(snip)

Jul  1 18:36:09 debian-hab corosync[1424]:   [TOTEM ] Decrementing problem counter for iface 192.168.0.209 to [2 of 1000]
Jul  1 18:36:11 debian-hab corosync[1424]:   [TOTEM ] Decrementing problem counter for iface 192.168.0.209 to [1 of 1000]
Jul  1 18:36:13 debian-hab corosync[1424]:   [TOTEM ] ring 0 active with no faults
}}}