Prereq.:
- Server with RedHat HCL https://hardware.redhat.com/
- Server with Fence device certified
https://access.redhat.com/kb/docs/DOC-30003
https://access.redhat.com/kb/docs/DOC-30004
- Storage
Documentation you must read:
- http://docs.redhat.com/docs/en-US/Red_Hat_Enterprise_Linux/5/html/Cluster_Administration/index.html
- http://docs.redhat.com/docs/en-US/Red_Hat_Enterprise_Linux/5/html/Cluster_Suite_Overview/index.html
- https://fedorahosted.org/cluster/wiki/HomePage
- https://access.redhat.com/kb/docs/DOC-59827
- http://www.centos.org/docs/5/pdf/Cluster_Administration.pdf
- http://www.syslog.gr/articles-mainmenu-99/26-redhat-cluster-suite.html
Network Definition:
- one "network" configured for LAN (client to server - application)
- one "network" configured for Intracluster/Heartbeat/Interlink (come più vi piace chiamarla)
- one "network" configured for Fence Device
- https://access.redhat.com/kb/docs/DOC-53348
Definition of Cluster and Cluster Node:
- File /etc/hosts https://access.redhat.com/kb/docs/DOC-25593
- Naming Server/Cluster Node https://access.redhat.com/kb/docs/DOC-5935
Fencing:
- FENCE ILO3 https://access.redhat.com/kb/docs/DOC-56880
- FENCE ACPID https://access.redhat.com/kb/docs/DOC-5415
- FENCE 2 nodi https://access.redhat.com/kb/docs/DOC-55111
- FENCE Parameters http://docs.redhat.com/docs/en-US/Red_Hat_Enterprise_Linux/5/html/Cluster_Administration/ap-fence-device-param-CA.html
Quorum:
- QuorumDisk https://access.redhat.com/knowledge/techbriefs/how-optimally-configure-quorum-disk-red-hat-enterprise-linux-clustering-and-hig
- https://access.redhat.com/sites/default/files/rhel_cluster_qdisk.pdf
- Qdisk in ambiente 2 nodi https://access.redhat.com/kb/docs/DOC-52938
- LastManStanding https://access.redhat.com/kb/docs/DOC-60258
- Le "5 Domande" sul Quorum https://access.redhat.com/kb/docs/DOC-52974
3-Nodes Cluster
- RHCS wasn't born for the complete fault tolerance, for example need to change the default installation to have a 3 node cluster that keeps services running in the event that 2 nodes are fault- https://access.redhat.com/kb/docs/DOC-23914
- https://access.redhat.com/kb/docs/DOC-49683
Service and Daemon:
- https://access.redhat.com/kb/docs/DOC-5899
Best Practice:
- https://access.redhat.com/kb/docs/DOC-40821
Test/Collaudo:
- Test FailOver https://access.redhat.com/kb/docs/DOC-5902
- Test Fencing https://access.redhat.com/kb/docs/DOC-52927
Setting PostPowerFailure https://access.redhat.com/kb/docs/DOC-16658
Activity:
First install S.O. with the yum repo Cluster and ClusterStorage
Stop acpid for not conflict with fence device:
#service acpid stop
#chkconfig --del acpid
Install software package for RedHatClustersuite
#yum install \
cman \
lm_sensors \
net-snmp \
net-snmp-libs \
openais \
perl-Net-Telnet \
tog-pegasus \
cluster-cim \
cluster-snmp \
modcluster \
rgmanager \
system-config-cluster \
perl-Crypt-SSLeay \
sg3_utils \
sg3_utils-devel \
sg3_utils-libs
Configure the cluster (i prefer the old more-lite utility)
#system-config-cluster
Naming the cluster,
Configure parameter as follow (best practice)
Naming the cluster,
Configure parameter as follow (best practice)
Post_Fail Delay = 10
Post_Join Delay = 30
Specify the Fence Daemon Properties parameters: Post-Join Delay and Post-Fail Delay.
The Post-Join Delay parameter is the number of seconds the fence daemon (fenced) waits before fencing a node after the node joins the fence domain. The Post-Join Delay default value is 3. A typical setting for Post-Join Delay is between 20 and 30 seconds, but can vary according to cluster and network performance.
The Post-Fail Delay parameter is the number of seconds the fence daemon (fenced) waits before fencing a node (a member of the fence domain) after the node has failed.The Post-Fail Delay default value is 0. Its value may be varied to suit cluster and network performance
Setting up the cluster nodes (it's important to set not the server hostname but the name identified in the file hosts with the Heartbeat Interface):
node1-hb (name bound in file hosts with heartbeat IP)
node2-hb (name bound in file hosts with heartbeat IP)
Create fence device:
node1-ilo --> user_ilo/pwd_ilo - node1-ilo
node2-ilo --> user_ilo/pwd_ilo - node2-ilo
Associate every cluster node with its own fence device
node01-hb -> fence device node1-ilo)
Tensting if fence system it's ok from command line
ILO 2 (old HP)
#fence_ilo -a node1-ilo -l user_ilo -o reboot -p pwd_ilo -v
ILO 3 (nuovi HP G7)
enable lanplus -P
increase timeout at 4 second with -T
#fence_ipmilan -P -a "ip_address_fence_device" -l cluster -p cluster2011 -o reboot -T 4
(see HP Resource)
Tensting if fence system it's ok from daemon
#fence_node nome_cluster_node (node1-hb)
Creating a FailOverDomain
(this is the group of server of the cluster over the service can switch)
(this is the group of server of the cluster over the service can switch)
OK! now you can start service CMAN e see the log if all the node creating and joining to the cluster.
#service cman start
(starting only cman service for checking the cluster infrastructure)
Creating Resource for the Clustered Service
creating IP resource with "Monitor Link"
creating FileSystem resource with "ForceUmount"
Creating Service
AutoStart = yes
RunExclusive = no
FailoverDomain = XXXXX
RecoveryPolicy = Relocate
Aling cluster conf to all server
the first time you must manual aling the conf file
#scp /etc/cluster/cluster.xml cluser_node_2:/etc/cluster/cluster.xml
#scp /etc/cluster/cluster.xml cluser_node_2:/etc/cluster/cluster.xml
if the service "cman" is alredy started you can user a command
# ccs_tool update /etc/cluster/cluster.conf
Reload Cluster Config after a manual upgrade with system online
# ccs_tool update /etc/cluster/cluster.conf
Reload Cluster Config after a manual upgrade with system online
# vi /etc/cluster/cluster.conf
modify the first line
# ccs_tool update /etc/cluster/cluster.conf
Config file updated from version 11 to 12
Update complete.
Test functionality of service configured in cluster before starting rgmanager service
testing with apache service
# rg_test test /etc/cluster/cluster.conf stop service APACHE
Running in test mode.
Stopping APACHE...
Stop of APACHEcomplete
# rg_test test /etc/cluster/cluster.conf start service APACHE
Running in test mode.
Starting APACHE...
Start of APACHEcomplete
If the test is ok, we are ready to start rgmanager service.
#service rgmanager start
In the directory
# cd /usr/share/cluster
you can list some example for service from RedHat
# mkdir /etc/cluster/script
# cp /usr/share/cluster/oracledb.sh /etc/cluster/script/
- test manual relocation of service
#clusvcadm -d service (disable)
#clusvcadm -e service (enable)
#clusvcadm -r service (relocate)
- test automatic relocation or in place restart (depending of your conf) of the service, in case a resource is fault
node1#tail -f /var/log/messages
node2#umount /file_system
you must read a restart or relocation of the service in log
node1#tail -f /var/log/messages
node2#umount /file_system
you must read a restart or relocation of the service in log
- test manual fencing of the node,with restart/relocate of service and following re-join to the cluster of the server
#fence_ipmilan -P -a "ip_address_fence_device" -l cluster -p cluster2011 -o reboot -T 4
#fence_ipmilan -P -a "ip_address_fence_device" -l cluster -p cluster2011 -o reboot -T 4
- test a node-crash, one node in kernel-panic, the other must fence it and relocate the service
node-1# echo 1> /proc/sys/kernel/sysrq# echo c > /proc/sysrq-trigger
node-2# tail -f /var/log/messages
- test a network heartbeat crash
# ifdown ethXXX - interfaccia sulla network-cluster/interlink/heartbeat
At the end it's simple......
- test a network heartbeat crash
# ifdown ethXXX - interfaccia sulla network-cluster/interlink/heartbeat
At the end it's simple......
Nessun commento:
Posta un commento