giovedì 4 agosto 2011

Red Hat Cluster Suite RHCS

Do you want to make a cluster? are you ready to start?

Prereq.:
- Server with RedHat HCL https://hardware.redhat.com/
- Server with Fence device certified
   https://access.redhat.com/kb/docs/DOC-30003
   https://access.redhat.com/kb/docs/DOC-30004
- Storage

Documentation you must read:
http://docs.redhat.com/docs/en-US/Red_Hat_Enterprise_Linux/5/html/Cluster_Administration/index.html
http://docs.redhat.com/docs/en-US/Red_Hat_Enterprise_Linux/5/html/Cluster_Suite_Overview/index.html
- https://fedorahosted.org/cluster/wiki/HomePage
https://access.redhat.com/kb/docs/DOC-59827
http://www.centos.org/docs/5/pdf/Cluster_Administration.pdf
http://www.syslog.gr/articles-mainmenu-99/26-redhat-cluster-suite.html


Network Definition:
- one "network" configured for LAN (client to server - application)
- one "network" configured for Intracluster/Heartbeat/Interlink (come più vi piace chiamarla)
- one "network" configured for Fence Device
https://access.redhat.com/kb/docs/DOC-53348



Definition of Cluster and Cluster Node:
- File /etc/hosts https://access.redhat.com/kb/docs/DOC-25593
- Naming Server/Cluster Node https://access.redhat.com/kb/docs/DOC-5935


Fencing:
- FENCE ILO3 https://access.redhat.com/kb/docs/DOC-56880
- FENCE ACPID https://access.redhat.com/kb/docs/DOC-5415
- FENCE 2 nodi https://access.redhat.com/kb/docs/DOC-55111
- FENCE Parameters http://docs.redhat.com/docs/en-US/Red_Hat_Enterprise_Linux/5/html/Cluster_Administration/ap-fence-device-param-CA.html

Quorum:
- QuorumDisk https://access.redhat.com/knowledge/techbriefs/how-optimally-configure-quorum-disk-red-hat-enterprise-linux-clustering-and-hig
-  https://access.redhat.com/sites/default/files/rhel_cluster_qdisk.pdf
- Qdisk in ambiente 2 nodi https://access.redhat.com/kb/docs/DOC-52938
- LastManStanding https://access.redhat.com/kb/docs/DOC-60258
- Le "5 Domande" sul Quorum https://access.redhat.com/kb/docs/DOC-52974

3-Nodes Cluster
- RHCS wasn't born for the complete fault tolerance, for example need to change the default installation to have a 3 node cluster that keeps services running in the event that 2 nodes are fault- https://access.redhat.com/kb/docs/DOC-23914
https://access.redhat.com/kb/docs/DOC-49683

Service and Daemon:
https://access.redhat.com/kb/docs/DOC-5899

Best Practice:
https://access.redhat.com/kb/docs/DOC-40821

Test/Collaudo:
- Test FailOver https://access.redhat.com/kb/docs/DOC-5902
- Test Fencing  https://access.redhat.com/kb/docs/DOC-52927

Setting PostPowerFailure https://access.redhat.com/kb/docs/DOC-16658

Activity:

First install S.O. with the yum repo Cluster and ClusterStorage

Stop acpid for not conflict with fence device:
#service acpid stop
#chkconfig --del acpid


Install software package for RedHatClustersuite
#yum install \
cman \
lm_sensors \
net-snmp \
net-snmp-libs \
openais \
perl-Net-Telnet \
tog-pegasus \
cluster-cim \
cluster-snmp \
modcluster  \
rgmanager \
system-config-cluster \
perl-Crypt-SSLeay \
sg3_utils \
sg3_utils-devel \
sg3_utils-libs


Configure the cluster (i prefer the old more-lite utility)
#system-config-cluster


Naming the cluster, 


Configure parameter as follow (best practice)



Post_Fail Delay = 10
Post_Join Delay = 30
Specify the Fence Daemon Properties parameters: Post-Join Delay and Post-Fail Delay.
The Post-Join Delay parameter is the number of seconds the fence daemon (fenced) waits before fencing a node after the node joins the fence domain. The Post-Join Delay default value is 3. A typical setting for Post-Join Delay is between 20 and 30 seconds, but can vary according to cluster and network performance.
The Post-Fail Delay parameter is the number of seconds the fence daemon (fenced) waits before fencing a node (a member of the fence domain) after the node has failed.The Post-Fail Delay default value is 0. Its value may be varied to suit cluster and network performance
Setting up the cluster nodes (it's important to set not the server hostname but the name identified in the file hosts with the Heartbeat Interface):

node1-hb (name bound in file hosts with heartbeat IP)
node2-hb (name bound in file hosts with heartbeat IP)

Create fence device:
node1-ilo --> user_ilo/pwd_ilo - node1-ilo
node2-ilo --> user_ilo/pwd_ilo - node2-ilo
Associate every cluster node with its own fence device
node01-hb -> fence device node1-ilo)
Tensting if fence system it's ok from command line

ILO 2 (old HP)

#fence_ilo -a node1-ilo -l user_ilo -o reboot -p pwd_ilo -v

ILO 3 (nuovi HP G7)
enable lanplus -P
increase timeout  at 4 second with -T

#fence_ipmilan -P -a "ip_address_fence_device" -l cluster -p cluster2011 -o reboot -T 4
(see HP Resource)

Tensting if fence system it's ok from daemon
#fence_node nome_cluster_node (node1-hb)

Creating a FailOverDomain
(this is the group of server of the cluster over the service can switch)

OK! now you can start service CMAN e see the log if all the node creating and joining to the cluster.

#service cman start
(starting only cman service for checking the cluster infrastructure)

Creating Resource for the Clustered Service
creating IP resource with "Monitor Link"
creating FileSystem resource with  "ForceUmount"

Creating Service
AutoStart = yes
RunExclusive = no
FailoverDomain = XXXXX
RecoveryPolicy = Relocate

Aling cluster conf to all server
the first time you must manual aling the conf file
#scp /etc/cluster/cluster.xml cluser_node_2:/etc/cluster/cluster.xml

if the service "cman" is alredy started you can user a command
# ccs_tool update /etc/cluster/cluster.conf


Reload Cluster Config after a manual upgrade with system online

# vi /etc/cluster/cluster.conf
modify the first line
config_version="12" name="XXXXXXXX">


# ccs_tool update /etc/cluster/cluster.conf
Config file updated from version 11 to 12
Update complete.

Test functionality of service configured in cluster before starting rgmanager service
testing with apache service

# rg_test test /etc/cluster/cluster.conf stop service APACHE
Running in test mode.
Stopping APACHE...
 Verifying Configuration Of apache:Apache
 Checking Syntax Of The File /etc/httpd/conf/httpd.conf
 Checking Syntax Of The File  > Succeed
  Stopping Service apache:Apache
  Stopping Service apache:Apache > Succeed
  unmounting /files_db
  Removing IPv4 address 10.154.4.160/22 from bond0
Stop of APACHEcomplete

# rg_test test /etc/cluster/cluster.conf start service APACHE
Running in test mode.
Starting APACHE...
 Link for bond0: Detected
  Adding IPv4 address 10.154.4.160/22 to bond0
 Sending gratuitous ARP: 10.154.4.160 00:23:7d:56:fb:f8 brd ff:ff:ff:ff:ff:ff
  mounting /dev/mapper/mpath0 on /files_db
 mount -t ext3  /dev/mapper/mpath0 /files_db
 Verifying Configuration Of apache:Apache
 Checking Syntax Of The File /etc/httpd/conf/httpd.conf
 Checking Syntax Of The File  > Succeed
  Starting Service apache:Apache
 Looking For IP Addresses
 1 IP addresses found for APACHE/Apache
 Looking For IP Addresses > Succeed -  IP Addresses Found
 Checking: SHA1 checksum of config file /etc/cluster/apache/apache:Apache/httpd.conf
 Checking: SHA1 checksum > succeed
 Generating New Config File /etc/cluster/apache/apache:Apache/httpd.conf From /etc/httpd/conf/httpd.conf
 Generating New Config File /etc/cluster/apache/apache:Apache/httpd.conf From /etc/httpd/conf/httpd.conf > Succeed
 Starting Service apache:Apache > Succeed
Start of APACHEcomplete

If the test is ok, we are ready to start rgmanager service.
#service rgmanager start

In the directory
# cd /usr/share/cluster
you can list some example for service from RedHat

# mkdir /etc/cluster/script
# cp /usr/share/cluster/oracledb.sh /etc/cluster/script/


Testing Infrastructure:

- test manual relocation of service
#clusvcadm -d service (disable)
#clusvcadm -e service (enable)
#clusvcadm -r service (relocate)

- test automatic relocation or in place restart (depending of your conf) of the service, in case a resource is fault
node1#tail -f /var/log/messages
node2#umount /file_system
you must read a restart or relocation of the service in log

- test manual fencing of the node,with restart/relocate of service and following re-join to the cluster of the server
#fence_ipmilan -P -a "ip_address_fence_device" -l cluster -p cluster2011 -o reboot -T 4

- test a node-crash, one node in kernel-panic, the other must fence it and relocate the service
node-1# echo 1> /proc/sys/kernel/sysrq# echo c > /proc/sysrq-trigger 
node-2# tail -f /var/log/messages
- test a network heartbeat crash
# ifdown ethXXX - interfaccia sulla network-cluster/interlink/heartbeat

At the end it's simple......



Nessun commento:

Posta un commento