heartbeat+pacemaker pgストリームレプリケーション自動切替を実現(二)
18500 ワード
五、テスト
5.1スペアノードの失効
node 2でpostgresデータベースプロセスを殺し、スタンバイノードでのデータベースクラッシュをシミュレートします.[root@node2 ~]# killall -9 postgres
クラスタのステータスを表示するには、次の手順に従います.[root@node1 ~]# crm_mon -Afr1
============
Last updated: Mon Jan 27 08:36:49 2014
Stack: Heartbeat
Current DC: node1 (30b7dc95-25c5-40d7-b1e4-7eaf2d5cdf07) - partition with quorum
Version: 1.0.12-unknown
2 Nodes configured, unknown expected votes
4 Resources configured.
============
Online: [ node1 node2 ]
Full list of resources:
vip-slave (ocf::heartbeat:IPaddr2): Started node1
Resource Group: master-group
vip-master (ocf::heartbeat:IPaddr2): Started node1
vip-rep (ocf::heartbeat:IPaddr2): Started node1
Master/Slave Set: msPostgresql
Masters: [ node1 ]
Stopped: [ pgsql:1 ]
Clone Set: clnPingCheck
Started: [ node1 node2 ]
Node Attributes:
* Node node1:
+ default_ping_set : 100
+ master-pgsql:0 : 1000
+ pgsql-data-status : LATEST
+ pgsql-master-baseline : 0000000010000000
+ pgsql-status : PRI
* Node node2:
+ default_ping_set : 100
+ master-pgsql:1 : -INFINITY
+ pgsql-data-status : DISCONNECT
+ pgsql-status : STOP
Migration summary:
* Node node1:
* Node node2:
pgsql:1: migration-threshold=1 fail-count=1
Failed actions:
pgsql:1_monitor_7000 (node=node2, call=11, rc=7, status=complete): not running
{vip-slaveリソースがnode 1に正常に切り替えられました}
node 2のheartbeatを再起動すると、データベースは再起動に伴います.[root@node2 ~]# service heartbeat restart
経過時間後にステータスを表示するには、次の手順に従います.[root@node1 ~]# crm_mon -Afr1
============
Last updated: Mon Jan 27 08:39:16 2014
Stack: Heartbeat
Current DC: node1 (30b7dc95-25c5-40d7-b1e4-7eaf2d5cdf07) - partition with quorum
Version: 1.0.12-unknown
2 Nodes configured, unknown expected votes
4 Resources configured.
============
Online: [ node1 node2 ]
Full list of resources:
vip-slave (ocf::heartbeat:IPaddr2): Started node2
Resource Group: master-group
vip-master (ocf::heartbeat:IPaddr2): Started node1
vip-rep (ocf::heartbeat:IPaddr2): Started node1
Master/Slave Set: msPostgresql
Masters: [ node1 ]
Slaves: [ node2 ]
Clone Set: clnPingCheck
Started: [ node1 node2 ]
Node Attributes:
* Node node1:
+ default_ping_set : 100
+ master-pgsql:0 : 1000
+ pgsql-data-status : LATEST
+ pgsql-master-baseline : 0000000010000000
+ pgsql-status : PRI
* Node node2:
+ default_ping_set : 100
+ master-pgsql:1 : 100
+ pgsql-data-status : STREAMING|SYNC
+ pgsql-status : HS:sync
Migration summary:
* Node node1:
* Node node2:
{vip-slaveはnod 2に戻り、ストリームレプリケーションは再確立された}
5.2メインノードのフェイルオーバー
node 1でpostgresデータベースプロセスを殺し、スタンバイノードでのデータベースクラッシュをシミュレートします.[root@node1 ~]# killall -9 postgres
クラスタのステータスが表示されます.[root@node2 ~]# crm_mon -Afr -1
============
Last updated: Mon Jan 27 08:43:03 2014
Stack: Heartbeat
Current DC: node1 (30b7dc95-25c5-40d7-b1e4-7eaf2d5cdf07) - partition with quorum
Version: 1.0.12-unknown
2 Nodes configured, unknown expected votes
4 Resources configured.
============
Online: [ node1 node2 ]
Full list of resources:
vip-slave (ocf::heartbeat:IPaddr2): Started node2
Resource Group: master-group
vip-master (ocf::heartbeat:IPaddr2): Started node2
vip-rep (ocf::heartbeat:IPaddr2): Started node2
Master/Slave Set: msPostgresql
Masters: [ node2 ]
Stopped: [ pgsql:0 ]
Clone Set: clnPingCheck
Started: [ node1 node2 ]
Node Attributes:
* Node node1:
+ default_ping_set : 100
+ master-pgsql:0 : -INFINITY
+ pgsql-data-status : DISCONNECT
+ pgsql-status : STOP
* Node node2:
+ default_ping_set : 100
+ master-pgsql:1 : 1000
+ pgsql-data-status : LATEST
+ pgsql-master-baseline : 00000000120000B0
+ pgsql-status : PRI
Migration summary:
* Node node1:
pgsql:0: migration-threshold=1 fail-count=1
* Node node2:
Failed actions:
pgsql:0_monitor_2000 (node=node1, call=25, rc=7, status=complete): not running
{vip-master/vip-repはnode 2に正常に切り替えられ、node 2はmasterになり、node 2のpgデータベースのステータスはPRIに切り替えられました}
5.3プライマリ・ノードのリカバリ
元のプライマリノードを修復して現在のスタンバイノードに復元
node 1でベース同期を実行するには、次の手順に従います.[postgres@node1 data]$ pwd
/opt/pgsql/data
[postgres@node1 data]$ rm -rf *
[postgres@node1 data]$ pg_basebackup -h 192.168.2.3 -U postgres -D /opt/pgsql/data/ -P
19172/19172 kB (100%), 1/1 tablespace
NOTICE: pg_stop_backup complete, all required WAL segments have been archived
[postgres@node1 data]$ ls
backup_label base pg_clog pg_ident.conf pg_notify pg_stat_tmp pg_tblspc PG_VERSION postgresql.conf
backup_label.old global pg_hba.conf pg_multixact pg_serial pg_subtrans pg_twophase pg_xlog recovery.done
Heartbeatを起動する前に、資本ロックを削除する必要があります.そうしないと、リソースはheartbeatの起動に伴いません.[root@node1 ~]# rm -rf /var/lib/pgsql/tmp/PGSQL.lock
{このロックファイルは、ノードがプライマリノードである場合に作成されますが、heartbeatの異常停止やデータベース/システムの異常終了によって自動的に削除されることはありません.したがって、ノードを復元する際に、ノードがプライマリノードとして機能している限り、手動でロックファイルをクリーンアップする必要があります}
node 1のheartbeatを再起動するには:[root@node1 ~]# service heartbeat restart
時間が経過すると、クラスタのステータスが表示されます.[root@node2 ~]# crm_mon -Afr1
============
Last updated: Mon Jan 27 08:50:43 2014
Stack: Heartbeat
Current DC: node2 (f2dcd1df-7429-42f5-82e9-b73921f97cab) - partition with quorum
Version: 1.0.12-unknown
2 Nodes configured, unknown expected votes
4 Resources configured.
============
Online: [ node1 node2 ]
Full list of resources:
vip-slave (ocf::heartbeat:IPaddr2): Started node1
Resource Group: master-group
vip-master (ocf::heartbeat:IPaddr2): Started node2
vip-rep (ocf::heartbeat:IPaddr2): Started node2
Master/Slave Set: msPostgresql
Masters: [ node2 ]
Slaves: [ node1 ]
Clone Set: clnPingCheck
Started: [ node1 node2 ]
Node Attributes:
* Node node1:
+ default_ping_set : 100
+ master-pgsql:0 : 100
+ pgsql-data-status : STREAMING|SYNC
+ pgsql-status : HS:sync
* Node node2:
+ default_ping_set : 100
+ master-pgsql:1 : 1000
+ pgsql-data-status : LATEST
+ pgsql-master-baseline : 00000000120000B0
+ pgsql-status : PRI
Migration summary:
* Node node1:
* Node node2:
{vip-slaveはnode 1に正常に切断され、node 1はストリームレプリケーションの準備ノードに正常になりました}
六、管理
6.1 heartbeatの起動と停止
[root@node1 ~]# service heartbeat start
[root@node1 ~]# service heartbeat stop
6.2 HAステータスの表示
[root@node1 ~]# crm status
6.3リソースステータスおよびノード属性の表示
[root@node1 ~]# crm_mon -Afr -1
6.4構成の表示
[root@node1 ~]# crm configure show
6.5リアルタイム監視HA
[root@node1 ~]# crm_mon -Afr
6.6 crm_义齿
リソースの起動/停止:
[root@node1 ~]# crm_resource -r vip-master -v started
[root@node1 ~]# crm_resource -r vip-master -v stoped
リソースを列挙:
[root@node1 ~]# crm_resource -L
vip-slave (ocf::heartbeat:IPaddr2): Started
Resource Group: master-group
vip-master (ocf::heartbeat:IPaddr2): Started
vip-rep (ocf::heartbeat:IPaddr2): Started
Master/Slave Set: msPostgresql [pgsql]
Masters: [ node1 ]
Slaves: [ node2 ]
Clone Set: clnPingCheck [pingCheck]
Started: [ node1 node2 ]
リソースの場所を表示するには、次の手順に従います。
[root@node1 ~]# crm_resource -W -r pgsql
resource pgsql is running on: node2
リソースの移行:
[root@node1 ~]# crm_resource -M -r vip-slave -N node2
リソースを削除するには
[root@node1 ~]# crm_resource -D -r vip-slave -t primitive
6.7 crmコマンド
指定されたRAを列挙:
[root@node1 ~]# crm ra list ocf pacemaker
ClusterMon Dummy HealthCPU HealthSMART Stateful SysInfo SystemHealth controld ping pingd
remote
ノードを削除するには
[root@node1 ~]# crm node delete node2
ノードの非アクティブ化:
[root@node1 ~]# crm node standby node2
ノードの有効化:
[root@node1 ~]# crm node online node2
pacemakerの構成:
[root@node1 ~]# crm configure
crm(live)configure#
……
……
crm(live)configure# commit
crm(live)configure# quit
6.8 failcountをリセット
[root@node1 ~]# crm resource
crm(live)resource# failcount pgsql set node1 0
crm(live)resource# failcount pgsql show node1
scope=status name=fail-count-pgsql value=0
[root@node1 ~]# crm resource cleanup pgsql
Cleaning up pgsql:0 on node1
Waiting for 1 replies from the CRMd. OK
[root@node1 ~]# crm_failcount -G -U node1 -r pgsql
scope=status name=fail-count-pgsql value=INFINITY
[root@node1 ~]# crm_failcount -D -U node1 -r pgsql
七、問題記録
7.1 Q1
問題:
heartbeatログには、次のエラーが表示されます.
Jan 24 07:47:36 node1 heartbeat: [2515]: WARN: nodename node1 uuid changed to node2
Jan 24 07:47:38 node1 heartbeat: [2515]: WARN: nodename node2 uuid changed to node1
Jan 24 07:47:38 node1 heartbeat: [2515]: WARN: nodename node1 uuid changed to node2
Jan 24 07:47:40 node1 heartbeat: [2515]: WARN: nodename node2 uuid changed to node1
Jan 24 07:47:40 node1 heartbeat: [2515]: WARN: nodename node1 uuid changed to node2
Jan 24 07:47:42 node1 heartbeat: [2515]: WARN: nodename node2 uuid changed to node1
Jan 24 07:47:42 node1 heartbeat: [2515]: WARN: nodename node1 uuid changed to node2
解決方法:
仮想マシンクローンで生成されたnode 2なのでhb_uuidは同じで、削除して再生成する必要があります.以下のようにします.[root@node2 ~]# rm -rf /var/lib/heartbeat/hb_uuid
[root@node2 ~]# service heartbeat restart
再起動すると新しいhb_が生成されます.uuid
7.2 Q2
問題:
ロード設定エラー:[root@node1 ~]# crm configure load update pgsql.crm
ERROR: pgsql: parameter rep_mode does not exist
ERROR: pgsql: parameter node_list does not exist
ERROR: pgsql: parameter master_ip does not exist
ERROR: pgsql: parameter restore_command does not exist
ERROR: pgsql: parameter primary_conninfo_opt does not exist
WARNING: pgsql: specified timeout 60s for stop is smaller than the advised 120
WARNING: pgsql: action monitor_Master not advertised in meta-data, it may not be supported by the RA
WARNING: pgsql: specified timeout 60s for start is smaller than the advised 120
WARNING: pgsql: action notify not advertised in meta-data, it may not be supported by the RA
WARNING: pgsql: action demote not advertised in meta-data, it may not be supported by the RA
WARNING: pgsql: action promote not advertised in meta-data, it may not be supported by the RA
WARNING: pingCheck: specified timeout 60s for start is smaller than the advised 90
WARNING: pingCheck: specified timeout 60s for stop is smaller than the advised 100
Do you still want to commit?
解決方法:
pgsqlスクリプトが古いため、pgsqlの構成はサポートされていない.crmで設定したパラメータの一部は、pgsqlをネット上からダウンロードして置き換える必要があります.
https://raw.github.com/ClusterLabs/resource-agents
7.3 Q3
問題:
ロード設定エラー:[root@node1 ~]# crm configure load update pgsql.crm
lrmadmin[15368]: 2014/01/24_09:18:44 ERROR: lrm_get_rsc_type_metadata(578): got a return code HA_FAIL from a reply message of rmetadata with function get_ret_from_msg.
ERROR: ocf:heartbeat:pgsql: could not parse meta-data:
ERROR: ocf:heartbeat:pgsql: could not parse meta-data:
ERROR: ocf:heartbeat:pgsql: no such resource agent
WARNING: pingCheck: specified timeout 60s for start is smaller than the advised 90
WARNING: pingCheck: specified timeout 60s for stop is smaller than the advised 100
Do you still want to commit?
解決方法:
pgsqlスクリプトの権限が正しくないため、次のコマンドで変更できます.
# chmod 755/usr/lib/ocf/resource.d/heartbeat/pgsql
7.4 Q4
問題:
heartbeatタイムズの起動エラー:[root@node1 ~]# service heartbeat start
/usr/lib/ocf/lib//heartbeat/ocf-shellfuncs: line 56: @OCF_ROOT_DIR@/lib/heartbeat/ocf-binaries: No such file or directory
解決方法:
CentOS 5では5中@OCF_ROOT_DIR@変数を正しいパスに置き換えることができないため、以下のようにスクリプトを変更することで実現できます.
ocf-shellfuncsを編集して、次のように変更します.
if [ -z "$OCF_ROOT"]; then
# : ${OCF_ROOT=@OCF_ROOT_DIR@}
: ${OCF_ROOT=/usr/lib/ocf}
fi
7.5 Q5
問題:
heartbeatタイムズの起動エラー:# service heartbeat start
/usr/lib/ocf/lib//heartbeat/ocf-shellfuncs: line 60: /usr/lib/ocf/lib/heartbeat/ocf-rarun: No such file or directory
解決方法:
ocf-rarunスクリプトが欠けているため
ダウンロードするには、次の手順に従います.
ダウンロードアドレスhttps://raw.github.com/ClusterLabs/resource-agents
7.6 Q6
問題:
heartbeatの起動時に起動スクリプトが見つからないため、エラーが発生しました.[root@db1 ~]# service heartbeat start
Starting High-Availability services: Heartbeat failure [rc=6]. Failed.
heartbeat[2074]: 2014/01/23_09:06:59 info: Pacemaker support: yes
heartbeat[2074]: 2014/01/23_09:06:59 ERROR: Client child command [/usr/lib64/heartbeat/cib] is not executable
heartbeat[2074]: 2014/01/23_09:06:59 ERROR: Directive failfast hacluster /usr/lib64/heartbeat/cib failed
heartbeat[2074]: 2014/01/23_09:06:59 ERROR: Client child command [/usr/lib64/heartbeat/stonithd] is not executable
heartbeat[2074]: 2014/01/23_09:06:59 ERROR: Directive respawn root /usr/lib64/heartbeat/stonithd failed
heartbeat[2074]: 2014/01/23_09:06:59 ERROR: Client child command [/usr/lib64/heartbeat/attrd] is not executable
heartbeat[2074]: 2014/01/23_09:06:59 ERROR: Directive respawn hacluster /usr/lib64/heartbeat/attrd failed
heartbeat[2074]: 2014/01/23_09:06:59 ERROR: Client child command [/usr/lib64/heartbeat/crmd] is not executable
heartbeat[2074]: 2014/01/23_09:06:59 ERROR: Directive failfast hacluster /usr/lib64/heartbeat/crmd failed
heartbeat[2074]: 2014/01/23_09:06:59 ERROR: Heartbeat not started: configuration error.
heartbeat[2074]: 2014/01/23_09:06:59 ERROR: Configuration error, heartbeat not started.
解決方法:ln -s /usr/libexec/pacemaker/cib /usr/lib64/heartbeat/cib
ln -s /usr/libexec/pacemaker/stonithd /usr/lib64/heartbeat/stonithd
ln -s /usr/libexec/pacemaker/attrd /usr/lib64/heartbeat/attrd
ln -s /usr/libexec/pacemaker/crmd /usr/lib64/heartbeat/crmd
7.7 Q7
問題:
heartbeatタイムズの起動エラー:
Jan 23 09:10:15 db1 heartbeat: [2129]: info: Heartbeat generation: 1390439416
Jan 23 09:10:15 db1 heartbeat: [2129]: info: No uuid found for current node - generating a new uuid.
Jan 23 09:10:15 db1 heartbeat: [2129]: info: Creating FIFO/var/lib/heartbeat/fifo.
Jan 23 09:10:15 db1 heartbeat: [2129]: info: glib: ucast: write socket priority set to IPTOS_LOWDELAY on eth1
Jan 23 09:10:15 db1 heartbeat: [2129]: info: glib: ucast: bound send socket to device: eth1
Jan 23 09:10:15 db1 heartbeat: [2129]: ERROR: glib: ucast: error setting option SO_REUSEPORT(w): Protocol not available
Jan 23 09:10:15 db1 heartbeat: [2129]: ERROR: make_io_childpair: cannot open ucast eth1
Jan 23 09:10:16 db1 heartbeat: [2132]: CRIT: Emergency Shutdown: Master Control process died.
Jan 23 09:10:16 db1 heartbeat: [2132]: CRIT: Killing pid 2129 with SIGTERM
Jan 23 09:10:16 db1 heartbeat: [2132]: CRIT: Emergency Shutdown(MCP dead): Killing ourselves.
解決方法:
1.カーネルバージョンをアップグレードし、現在のカーネルバージョンはucastをサポートしていない.
2.mcast/bcastのような他の検出方式を置き換える.
7.8 Q8
問題:
bcast心拍数検出方式を使用すると、エラーが発生します.
Jan 24 01:30:20 db2 heartbeat: [29856]: ERROR: glib: Error binding socket (Address already in use). Retrying.
Jan 24 01:30:21 db2 heartbeat: [29856]: ERROR: glib: Error binding socket (Address already in use). Retrying.
Jan 24 01:30:22 db2 heartbeat: [29856]: ERROR: glib: Error binding socket (Address already in use). Retrying.
Jan 24 01:30:23 db2 heartbeat: [29856]: ERROR: glib: Error binding socket (Address already in use). Retrying.
Jan 24 01:30:24 db2 heartbeat: [29856]: ERROR: glib: Unable to bind socket (Address already in use). Giving up.
Jan 24 01:30:24 db2 heartbeat: [29856]: info: glib: UDP Broadcast heartbeat closed on port 694 interface eth1 - Status: 1
Jan 24 01:30:24 db2 heartbeat: [29856]: ERROR: make_io_childpair: cannot open bcast eth1
Jan 24 01:30:25 db2 heartbeat: [29859]: CRIT: Emergency Shutdown: Master Control process died.
Jan 24 01:30:25 db2 heartbeat: [29859]: CRIT: Killing pid 29856 with SIGTERM
Jan 24 01:30:25 db2 heartbeat: [29859]: CRIT: Emergency Shutdown(MCP dead): Killing ourselves.
解決方法:
説明694ポートはすでに占有されており、表示されます.[root@db1 ~]# netstat -nlp | grep 694
udp 0 0 0.0.0.0:694 0.0.0.0:* 1367/rpcbind
udp 0 0 :::694 :::* 1367/rpcbind
UDPポートをha.cfでudpport 692を指定
八、参考資源
スクリプト:
https://github.com/ClusterLabs/resource-agents/blob/master/heartbeat/pgsql
スクリプトの使用方法:
https://github.com/t-matsuo/resource-agents/wiki/Resource-Agent-for-PostgreSQL-9.1-streaming-replication
crm_resouceコマンド:
http://www.novell.com/zh-cn/documentation/sle_ha/book_sleha/data/man_crmresource.html
crm_failcountコマンド:
http://www.novell.com/zh-cn/documentation/sle_ha/book_sleha/data/man_crmfailcount.html
[root@node2 ~]# killall -9 postgres
[root@node1 ~]# crm_mon -Afr1
============
Last updated: Mon Jan 27 08:36:49 2014
Stack: Heartbeat
Current DC: node1 (30b7dc95-25c5-40d7-b1e4-7eaf2d5cdf07) - partition with quorum
Version: 1.0.12-unknown
2 Nodes configured, unknown expected votes
4 Resources configured.
============
Online: [ node1 node2 ]
Full list of resources:
vip-slave (ocf::heartbeat:IPaddr2): Started node1
Resource Group: master-group
vip-master (ocf::heartbeat:IPaddr2): Started node1
vip-rep (ocf::heartbeat:IPaddr2): Started node1
Master/Slave Set: msPostgresql
Masters: [ node1 ]
Stopped: [ pgsql:1 ]
Clone Set: clnPingCheck
Started: [ node1 node2 ]
Node Attributes:
* Node node1:
+ default_ping_set : 100
+ master-pgsql:0 : 1000
+ pgsql-data-status : LATEST
+ pgsql-master-baseline : 0000000010000000
+ pgsql-status : PRI
* Node node2:
+ default_ping_set : 100
+ master-pgsql:1 : -INFINITY
+ pgsql-data-status : DISCONNECT
+ pgsql-status : STOP
Migration summary:
* Node node1:
* Node node2:
pgsql:1: migration-threshold=1 fail-count=1
Failed actions:
pgsql:1_monitor_7000 (node=node2, call=11, rc=7, status=complete): not running
[root@node2 ~]# service heartbeat restart
[root@node1 ~]# crm_mon -Afr1
============
Last updated: Mon Jan 27 08:39:16 2014
Stack: Heartbeat
Current DC: node1 (30b7dc95-25c5-40d7-b1e4-7eaf2d5cdf07) - partition with quorum
Version: 1.0.12-unknown
2 Nodes configured, unknown expected votes
4 Resources configured.
============
Online: [ node1 node2 ]
Full list of resources:
vip-slave (ocf::heartbeat:IPaddr2): Started node2
Resource Group: master-group
vip-master (ocf::heartbeat:IPaddr2): Started node1
vip-rep (ocf::heartbeat:IPaddr2): Started node1
Master/Slave Set: msPostgresql
Masters: [ node1 ]
Slaves: [ node2 ]
Clone Set: clnPingCheck
Started: [ node1 node2 ]
Node Attributes:
* Node node1:
+ default_ping_set : 100
+ master-pgsql:0 : 1000
+ pgsql-data-status : LATEST
+ pgsql-master-baseline : 0000000010000000
+ pgsql-status : PRI
* Node node2:
+ default_ping_set : 100
+ master-pgsql:1 : 100
+ pgsql-data-status : STREAMING|SYNC
+ pgsql-status : HS:sync
Migration summary:
* Node node1:
* Node node2:
[root@node1 ~]# killall -9 postgres
[root@node2 ~]# crm_mon -Afr -1
============
Last updated: Mon Jan 27 08:43:03 2014
Stack: Heartbeat
Current DC: node1 (30b7dc95-25c5-40d7-b1e4-7eaf2d5cdf07) - partition with quorum
Version: 1.0.12-unknown
2 Nodes configured, unknown expected votes
4 Resources configured.
============
Online: [ node1 node2 ]
Full list of resources:
vip-slave (ocf::heartbeat:IPaddr2): Started node2
Resource Group: master-group
vip-master (ocf::heartbeat:IPaddr2): Started node2
vip-rep (ocf::heartbeat:IPaddr2): Started node2
Master/Slave Set: msPostgresql
Masters: [ node2 ]
Stopped: [ pgsql:0 ]
Clone Set: clnPingCheck
Started: [ node1 node2 ]
Node Attributes:
* Node node1:
+ default_ping_set : 100
+ master-pgsql:0 : -INFINITY
+ pgsql-data-status : DISCONNECT
+ pgsql-status : STOP
* Node node2:
+ default_ping_set : 100
+ master-pgsql:1 : 1000
+ pgsql-data-status : LATEST
+ pgsql-master-baseline : 00000000120000B0
+ pgsql-status : PRI
Migration summary:
* Node node1:
pgsql:0: migration-threshold=1 fail-count=1
* Node node2:
Failed actions:
pgsql:0_monitor_2000 (node=node1, call=25, rc=7, status=complete): not running
[postgres@node1 data]$ pwd
/opt/pgsql/data
[postgres@node1 data]$ rm -rf *
[postgres@node1 data]$ pg_basebackup -h 192.168.2.3 -U postgres -D /opt/pgsql/data/ -P
19172/19172 kB (100%), 1/1 tablespace
NOTICE: pg_stop_backup complete, all required WAL segments have been archived
[postgres@node1 data]$ ls
backup_label base pg_clog pg_ident.conf pg_notify pg_stat_tmp pg_tblspc PG_VERSION postgresql.conf
backup_label.old global pg_hba.conf pg_multixact pg_serial pg_subtrans pg_twophase pg_xlog recovery.done
[root@node1 ~]# rm -rf /var/lib/pgsql/tmp/PGSQL.lock
[root@node1 ~]# service heartbeat restart
[root@node2 ~]# crm_mon -Afr1
============
Last updated: Mon Jan 27 08:50:43 2014
Stack: Heartbeat
Current DC: node2 (f2dcd1df-7429-42f5-82e9-b73921f97cab) - partition with quorum
Version: 1.0.12-unknown
2 Nodes configured, unknown expected votes
4 Resources configured.
============
Online: [ node1 node2 ]
Full list of resources:
vip-slave (ocf::heartbeat:IPaddr2): Started node1
Resource Group: master-group
vip-master (ocf::heartbeat:IPaddr2): Started node2
vip-rep (ocf::heartbeat:IPaddr2): Started node2
Master/Slave Set: msPostgresql
Masters: [ node2 ]
Slaves: [ node1 ]
Clone Set: clnPingCheck
Started: [ node1 node2 ]
Node Attributes:
* Node node1:
+ default_ping_set : 100
+ master-pgsql:0 : 100
+ pgsql-data-status : STREAMING|SYNC
+ pgsql-status : HS:sync
* Node node2:
+ default_ping_set : 100
+ master-pgsql:1 : 1000
+ pgsql-data-status : LATEST
+ pgsql-master-baseline : 00000000120000B0
+ pgsql-status : PRI
Migration summary:
* Node node1:
* Node node2:
6.1 heartbeatの起動と停止
[root@node1 ~]# service heartbeat start
[root@node1 ~]# service heartbeat stop
6.2 HAステータスの表示
[root@node1 ~]# crm status
6.3リソースステータスおよびノード属性の表示
[root@node1 ~]# crm_mon -Afr -1
6.4構成の表示
[root@node1 ~]# crm configure show
6.5リアルタイム監視HA
[root@node1 ~]# crm_mon -Afr
6.6 crm_义齿
リソースの起動/停止:
[root@node1 ~]# crm_resource -r vip-master -v started
[root@node1 ~]# crm_resource -r vip-master -v stoped
リソースを列挙:
[root@node1 ~]# crm_resource -L
vip-slave (ocf::heartbeat:IPaddr2): Started
Resource Group: master-group
vip-master (ocf::heartbeat:IPaddr2): Started
vip-rep (ocf::heartbeat:IPaddr2): Started
Master/Slave Set: msPostgresql [pgsql]
Masters: [ node1 ]
Slaves: [ node2 ]
Clone Set: clnPingCheck [pingCheck]
Started: [ node1 node2 ]
リソースの場所を表示するには、次の手順に従います。
[root@node1 ~]# crm_resource -W -r pgsql
resource pgsql is running on: node2
リソースの移行:
[root@node1 ~]# crm_resource -M -r vip-slave -N node2
リソースを削除するには
[root@node1 ~]# crm_resource -D -r vip-slave -t primitive
6.7 crmコマンド
指定されたRAを列挙:
[root@node1 ~]# crm ra list ocf pacemaker
ClusterMon Dummy HealthCPU HealthSMART Stateful SysInfo SystemHealth controld ping pingd
remote
ノードを削除するには
[root@node1 ~]# crm node delete node2
ノードの非アクティブ化:
[root@node1 ~]# crm node standby node2
ノードの有効化:
[root@node1 ~]# crm node online node2
pacemakerの構成:
[root@node1 ~]# crm configure
crm(live)configure#
……
……
crm(live)configure# commit
crm(live)configure# quit
6.8 failcountをリセット
[root@node1 ~]# crm resource
crm(live)resource# failcount pgsql set node1 0
crm(live)resource# failcount pgsql show node1
scope=status name=fail-count-pgsql value=0
[root@node1 ~]# crm resource cleanup pgsql
Cleaning up pgsql:0 on node1
Waiting for 1 replies from the CRMd. OK
[root@node1 ~]# crm_failcount -G -U node1 -r pgsql
scope=status name=fail-count-pgsql value=INFINITY
[root@node1 ~]# crm_failcount -D -U node1 -r pgsql
七、問題記録
7.1 Q1
問題:
heartbeatログには、次のエラーが表示されます.
Jan 24 07:47:36 node1 heartbeat: [2515]: WARN: nodename node1 uuid changed to node2
Jan 24 07:47:38 node1 heartbeat: [2515]: WARN: nodename node2 uuid changed to node1
Jan 24 07:47:38 node1 heartbeat: [2515]: WARN: nodename node1 uuid changed to node2
Jan 24 07:47:40 node1 heartbeat: [2515]: WARN: nodename node2 uuid changed to node1
Jan 24 07:47:40 node1 heartbeat: [2515]: WARN: nodename node1 uuid changed to node2
Jan 24 07:47:42 node1 heartbeat: [2515]: WARN: nodename node2 uuid changed to node1
Jan 24 07:47:42 node1 heartbeat: [2515]: WARN: nodename node1 uuid changed to node2
解決方法:
仮想マシンクローンで生成されたnode 2なのでhb_uuidは同じで、削除して再生成する必要があります.以下のようにします.[root@node2 ~]# rm -rf /var/lib/heartbeat/hb_uuid
[root@node2 ~]# service heartbeat restart
再起動すると新しいhb_が生成されます.uuid
7.2 Q2
問題:
ロード設定エラー:[root@node1 ~]# crm configure load update pgsql.crm
ERROR: pgsql: parameter rep_mode does not exist
ERROR: pgsql: parameter node_list does not exist
ERROR: pgsql: parameter master_ip does not exist
ERROR: pgsql: parameter restore_command does not exist
ERROR: pgsql: parameter primary_conninfo_opt does not exist
WARNING: pgsql: specified timeout 60s for stop is smaller than the advised 120
WARNING: pgsql: action monitor_Master not advertised in meta-data, it may not be supported by the RA
WARNING: pgsql: specified timeout 60s for start is smaller than the advised 120
WARNING: pgsql: action notify not advertised in meta-data, it may not be supported by the RA
WARNING: pgsql: action demote not advertised in meta-data, it may not be supported by the RA
WARNING: pgsql: action promote not advertised in meta-data, it may not be supported by the RA
WARNING: pingCheck: specified timeout 60s for start is smaller than the advised 90
WARNING: pingCheck: specified timeout 60s for stop is smaller than the advised 100
Do you still want to commit?
解決方法:
pgsqlスクリプトが古いため、pgsqlの構成はサポートされていない.crmで設定したパラメータの一部は、pgsqlをネット上からダウンロードして置き換える必要があります.
https://raw.github.com/ClusterLabs/resource-agents
7.3 Q3
問題:
ロード設定エラー:[root@node1 ~]# crm configure load update pgsql.crm
lrmadmin[15368]: 2014/01/24_09:18:44 ERROR: lrm_get_rsc_type_metadata(578): got a return code HA_FAIL from a reply message of rmetadata with function get_ret_from_msg.
ERROR: ocf:heartbeat:pgsql: could not parse meta-data:
ERROR: ocf:heartbeat:pgsql: could not parse meta-data:
ERROR: ocf:heartbeat:pgsql: no such resource agent
WARNING: pingCheck: specified timeout 60s for start is smaller than the advised 90
WARNING: pingCheck: specified timeout 60s for stop is smaller than the advised 100
Do you still want to commit?
解決方法:
pgsqlスクリプトの権限が正しくないため、次のコマンドで変更できます.
# chmod 755/usr/lib/ocf/resource.d/heartbeat/pgsql
7.4 Q4
問題:
heartbeatタイムズの起動エラー:[root@node1 ~]# service heartbeat start
/usr/lib/ocf/lib//heartbeat/ocf-shellfuncs: line 56: @OCF_ROOT_DIR@/lib/heartbeat/ocf-binaries: No such file or directory
解決方法:
CentOS 5では5中@OCF_ROOT_DIR@変数を正しいパスに置き換えることができないため、以下のようにスクリプトを変更することで実現できます.
ocf-shellfuncsを編集して、次のように変更します.
if [ -z "$OCF_ROOT"]; then
# : ${OCF_ROOT=@OCF_ROOT_DIR@}
: ${OCF_ROOT=/usr/lib/ocf}
fi
7.5 Q5
問題:
heartbeatタイムズの起動エラー:# service heartbeat start
/usr/lib/ocf/lib//heartbeat/ocf-shellfuncs: line 60: /usr/lib/ocf/lib/heartbeat/ocf-rarun: No such file or directory
解決方法:
ocf-rarunスクリプトが欠けているため
ダウンロードするには、次の手順に従います.
ダウンロードアドレスhttps://raw.github.com/ClusterLabs/resource-agents
7.6 Q6
問題:
heartbeatの起動時に起動スクリプトが見つからないため、エラーが発生しました.[root@db1 ~]# service heartbeat start
Starting High-Availability services: Heartbeat failure [rc=6]. Failed.
heartbeat[2074]: 2014/01/23_09:06:59 info: Pacemaker support: yes
heartbeat[2074]: 2014/01/23_09:06:59 ERROR: Client child command [/usr/lib64/heartbeat/cib] is not executable
heartbeat[2074]: 2014/01/23_09:06:59 ERROR: Directive failfast hacluster /usr/lib64/heartbeat/cib failed
heartbeat[2074]: 2014/01/23_09:06:59 ERROR: Client child command [/usr/lib64/heartbeat/stonithd] is not executable
heartbeat[2074]: 2014/01/23_09:06:59 ERROR: Directive respawn root /usr/lib64/heartbeat/stonithd failed
heartbeat[2074]: 2014/01/23_09:06:59 ERROR: Client child command [/usr/lib64/heartbeat/attrd] is not executable
heartbeat[2074]: 2014/01/23_09:06:59 ERROR: Directive respawn hacluster /usr/lib64/heartbeat/attrd failed
heartbeat[2074]: 2014/01/23_09:06:59 ERROR: Client child command [/usr/lib64/heartbeat/crmd] is not executable
heartbeat[2074]: 2014/01/23_09:06:59 ERROR: Directive failfast hacluster /usr/lib64/heartbeat/crmd failed
heartbeat[2074]: 2014/01/23_09:06:59 ERROR: Heartbeat not started: configuration error.
heartbeat[2074]: 2014/01/23_09:06:59 ERROR: Configuration error, heartbeat not started.
解決方法:ln -s /usr/libexec/pacemaker/cib /usr/lib64/heartbeat/cib
ln -s /usr/libexec/pacemaker/stonithd /usr/lib64/heartbeat/stonithd
ln -s /usr/libexec/pacemaker/attrd /usr/lib64/heartbeat/attrd
ln -s /usr/libexec/pacemaker/crmd /usr/lib64/heartbeat/crmd
7.7 Q7
問題:
heartbeatタイムズの起動エラー:
Jan 23 09:10:15 db1 heartbeat: [2129]: info: Heartbeat generation: 1390439416
Jan 23 09:10:15 db1 heartbeat: [2129]: info: No uuid found for current node - generating a new uuid.
Jan 23 09:10:15 db1 heartbeat: [2129]: info: Creating FIFO/var/lib/heartbeat/fifo.
Jan 23 09:10:15 db1 heartbeat: [2129]: info: glib: ucast: write socket priority set to IPTOS_LOWDELAY on eth1
Jan 23 09:10:15 db1 heartbeat: [2129]: info: glib: ucast: bound send socket to device: eth1
Jan 23 09:10:15 db1 heartbeat: [2129]: ERROR: glib: ucast: error setting option SO_REUSEPORT(w): Protocol not available
Jan 23 09:10:15 db1 heartbeat: [2129]: ERROR: make_io_childpair: cannot open ucast eth1
Jan 23 09:10:16 db1 heartbeat: [2132]: CRIT: Emergency Shutdown: Master Control process died.
Jan 23 09:10:16 db1 heartbeat: [2132]: CRIT: Killing pid 2129 with SIGTERM
Jan 23 09:10:16 db1 heartbeat: [2132]: CRIT: Emergency Shutdown(MCP dead): Killing ourselves.
解決方法:
1.カーネルバージョンをアップグレードし、現在のカーネルバージョンはucastをサポートしていない.
2.mcast/bcastのような他の検出方式を置き換える.
7.8 Q8
問題:
bcast心拍数検出方式を使用すると、エラーが発生します.
Jan 24 01:30:20 db2 heartbeat: [29856]: ERROR: glib: Error binding socket (Address already in use). Retrying.
Jan 24 01:30:21 db2 heartbeat: [29856]: ERROR: glib: Error binding socket (Address already in use). Retrying.
Jan 24 01:30:22 db2 heartbeat: [29856]: ERROR: glib: Error binding socket (Address already in use). Retrying.
Jan 24 01:30:23 db2 heartbeat: [29856]: ERROR: glib: Error binding socket (Address already in use). Retrying.
Jan 24 01:30:24 db2 heartbeat: [29856]: ERROR: glib: Unable to bind socket (Address already in use). Giving up.
Jan 24 01:30:24 db2 heartbeat: [29856]: info: glib: UDP Broadcast heartbeat closed on port 694 interface eth1 - Status: 1
Jan 24 01:30:24 db2 heartbeat: [29856]: ERROR: make_io_childpair: cannot open bcast eth1
Jan 24 01:30:25 db2 heartbeat: [29859]: CRIT: Emergency Shutdown: Master Control process died.
Jan 24 01:30:25 db2 heartbeat: [29859]: CRIT: Killing pid 29856 with SIGTERM
Jan 24 01:30:25 db2 heartbeat: [29859]: CRIT: Emergency Shutdown(MCP dead): Killing ourselves.
解決方法:
説明694ポートはすでに占有されており、表示されます.[root@db1 ~]# netstat -nlp | grep 694
udp 0 0 0.0.0.0:694 0.0.0.0:* 1367/rpcbind
udp 0 0 :::694 :::* 1367/rpcbind
UDPポートをha.cfでudpport 692を指定
八、参考資源
スクリプト:
https://github.com/ClusterLabs/resource-agents/blob/master/heartbeat/pgsql
スクリプトの使用方法:
https://github.com/t-matsuo/resource-agents/wiki/Resource-Agent-for-PostgreSQL-9.1-streaming-replication
crm_resouceコマンド:
http://www.novell.com/zh-cn/documentation/sle_ha/book_sleha/data/man_crmresource.html
crm_failcountコマンド:
http://www.novell.com/zh-cn/documentation/sle_ha/book_sleha/data/man_crmfailcount.html
[root@node2 ~]# rm -rf /var/lib/heartbeat/hb_uuid
[root@node2 ~]# service heartbeat restart
[root@node1 ~]# crm configure load update pgsql.crm
ERROR: pgsql: parameter rep_mode does not exist
ERROR: pgsql: parameter node_list does not exist
ERROR: pgsql: parameter master_ip does not exist
ERROR: pgsql: parameter restore_command does not exist
ERROR: pgsql: parameter primary_conninfo_opt does not exist
WARNING: pgsql: specified timeout 60s for stop is smaller than the advised 120
WARNING: pgsql: action monitor_Master not advertised in meta-data, it may not be supported by the RA
WARNING: pgsql: specified timeout 60s for start is smaller than the advised 120
WARNING: pgsql: action notify not advertised in meta-data, it may not be supported by the RA
WARNING: pgsql: action demote not advertised in meta-data, it may not be supported by the RA
WARNING: pgsql: action promote not advertised in meta-data, it may not be supported by the RA
WARNING: pingCheck: specified timeout 60s for start is smaller than the advised 90
WARNING: pingCheck: specified timeout 60s for stop is smaller than the advised 100
Do you still want to commit?
[root@node1 ~]# crm configure load update pgsql.crm
lrmadmin[15368]: 2014/01/24_09:18:44 ERROR: lrm_get_rsc_type_metadata(578): got a return code HA_FAIL from a reply message of rmetadata with function get_ret_from_msg.
ERROR: ocf:heartbeat:pgsql: could not parse meta-data:
ERROR: ocf:heartbeat:pgsql: could not parse meta-data:
ERROR: ocf:heartbeat:pgsql: no such resource agent
WARNING: pingCheck: specified timeout 60s for start is smaller than the advised 90
WARNING: pingCheck: specified timeout 60s for stop is smaller than the advised 100
Do you still want to commit?
[root@node1 ~]# service heartbeat start
/usr/lib/ocf/lib//heartbeat/ocf-shellfuncs: line 56: @OCF_ROOT_DIR@/lib/heartbeat/ocf-binaries: No such file or directory
# service heartbeat start
/usr/lib/ocf/lib//heartbeat/ocf-shellfuncs: line 60: /usr/lib/ocf/lib/heartbeat/ocf-rarun: No such file or directory
[root@db1 ~]# service heartbeat start
Starting High-Availability services: Heartbeat failure [rc=6]. Failed.
heartbeat[2074]: 2014/01/23_09:06:59 info: Pacemaker support: yes
heartbeat[2074]: 2014/01/23_09:06:59 ERROR: Client child command [/usr/lib64/heartbeat/cib] is not executable
heartbeat[2074]: 2014/01/23_09:06:59 ERROR: Directive failfast hacluster /usr/lib64/heartbeat/cib failed
heartbeat[2074]: 2014/01/23_09:06:59 ERROR: Client child command [/usr/lib64/heartbeat/stonithd] is not executable
heartbeat[2074]: 2014/01/23_09:06:59 ERROR: Directive respawn root /usr/lib64/heartbeat/stonithd failed
heartbeat[2074]: 2014/01/23_09:06:59 ERROR: Client child command [/usr/lib64/heartbeat/attrd] is not executable
heartbeat[2074]: 2014/01/23_09:06:59 ERROR: Directive respawn hacluster /usr/lib64/heartbeat/attrd failed
heartbeat[2074]: 2014/01/23_09:06:59 ERROR: Client child command [/usr/lib64/heartbeat/crmd] is not executable
heartbeat[2074]: 2014/01/23_09:06:59 ERROR: Directive failfast hacluster /usr/lib64/heartbeat/crmd failed
heartbeat[2074]: 2014/01/23_09:06:59 ERROR: Heartbeat not started: configuration error.
heartbeat[2074]: 2014/01/23_09:06:59 ERROR: Configuration error, heartbeat not started.
ln -s /usr/libexec/pacemaker/cib /usr/lib64/heartbeat/cib
ln -s /usr/libexec/pacemaker/stonithd /usr/lib64/heartbeat/stonithd
ln -s /usr/libexec/pacemaker/attrd /usr/lib64/heartbeat/attrd
ln -s /usr/libexec/pacemaker/crmd /usr/lib64/heartbeat/crmd
[root@db1 ~]# netstat -nlp | grep 694
udp 0 0 0.0.0.0:694 0.0.0.0:* 1367/rpcbind
udp 0 0 :::694 :::* 1367/rpcbind
スクリプト:
https://github.com/ClusterLabs/resource-agents/blob/master/heartbeat/pgsql
スクリプトの使用方法:
https://github.com/t-matsuo/resource-agents/wiki/Resource-Agent-for-PostgreSQL-9.1-streaming-replication
crm_resouceコマンド:
http://www.novell.com/zh-cn/documentation/sle_ha/book_sleha/data/man_crmresource.html
crm_failcountコマンド:
http://www.novell.com/zh-cn/documentation/sle_ha/book_sleha/data/man_crmfailcount.html