DARIO DALL'OMO: Red Hat Cluster Suite RHCS

Do you want to make a cluster? are you ready to start?

Prereq.:
- Server with RedHat HCL https://hardware.redhat.com/
- Server with Fence device certified
https://access.redhat.com/kb/docs/DOC-30003
https://access.redhat.com/kb/docs/DOC-30004
- Storage

Documentation you must read:
- http://docs.redhat.com/docs/en-US/Red_Hat_Enterprise_Linux/5/html/Cluster_Administration/index.html
- http://docs.redhat.com/docs/en-US/Red_Hat_Enterprise_Linux/5/html/Cluster_Suite_Overview/index.html
- https://fedorahosted.org/cluster/wiki/HomePage
- https://access.redhat.com/kb/docs/DOC-59827
- http://www.centos.org/docs/5/pdf/Cluster_Administration.pdf
- http://www.syslog.gr/articles-mainmenu-99/26-redhat-cluster-suite.html

Network Definition:
- one "network" configured for LAN (client to server - application)
- one "network" configured for Intracluster/Heartbeat/Interlink (come più vi piace chiamarla)
- one "network" configured for Fence Device
- https://access.redhat.com/kb/docs/DOC-53348

Definition of Cluster and Cluster Node:
- File /etc/hosts https://access.redhat.com/kb/docs/DOC-25593
- Naming Server/Cluster Node https://access.redhat.com/kb/docs/DOC-5935

Fencing:
- FENCE ILO3 https://access.redhat.com/kb/docs/DOC-56880
- FENCE ACPID https://access.redhat.com/kb/docs/DOC-5415
- FENCE 2 nodi https://access.redhat.com/kb/docs/DOC-55111
- FENCE Parameters http://docs.redhat.com/docs/en-US/Red_Hat_Enterprise_Linux/5/html/Cluster_Administration/ap-fence-device-param-CA.html

Quorum:
- QuorumDisk https://access.redhat.com/knowledge/techbriefs/how-optimally-configure-quorum-disk-red-hat-enterprise-linux-clustering-and-hig
- https://access.redhat.com/sites/default/files/rhel_cluster_qdisk.pdf
- Qdisk in ambiente 2 nodi https://access.redhat.com/kb/docs/DOC-52938
- LastManStanding https://access.redhat.com/kb/docs/DOC-60258
- Le "5 Domande" sul Quorum https://access.redhat.com/kb/docs/DOC-52974

3-Nodes Cluster
- RHCS wasn't born for the complete fault tolerance, for example need to change the default installation to have a 3 node cluster that keeps services running in the event that 2 nodes are fault- https://access.redhat.com/kb/docs/DOC-23914
- https://access.redhat.com/kb/docs/DOC-49683

Service and Daemon:
- https://access.redhat.com/kb/docs/DOC-5899

Best Practice:
- https://access.redhat.com/kb/docs/DOC-40821

Test/Collaudo:
- Test FailOver https://access.redhat.com/kb/docs/DOC-5902
- Test Fencing https://access.redhat.com/kb/docs/DOC-52927

Setting PostPowerFailure https://access.redhat.com/kb/docs/DOC-16658

Activity:

First install S.O. with the yum repo Cluster and ClusterStorage

Stop acpid for not conflict with fence device:

#service acpid stop

#chkconfig --del acpid

Install software package for RedHatClustersuite

#yum install \

cman \

lm_sensors \

net-snmp \

net-snmp-libs \

openais \

perl-Net-Telnet \

tog-pegasus \

cluster-cim \

cluster-snmp \

modcluster \

rgmanager \

system-config-cluster \

perl-Crypt-SSLeay \

sg3_utils \

sg3_utils-devel \

sg3_utils-libs

Configure the cluster (i prefer the old more-lite utility)

#system-config-cluster

Naming the cluster,

Configure parameter as follow (best practice)

Post_Fail Delay = 10

Post_Join Delay = 30

Specify the Fence Daemon Properties parameters: Post-Join Delay and Post-Fail Delay.

The Post-Join Delay parameter is the number of seconds the fence daemon (fenced) waits before fencing a node after the node joins the fence domain. The Post-Join Delay default value is 3. A typical setting for Post-Join Delay is between 20 and 30 seconds, but can vary according to cluster and network performance.

The Post-Fail Delay parameter is the number of seconds the fence daemon (fenced) waits before fencing a node (a member of the fence domain) after the node has failed.The Post-Fail Delay default value is 0. Its value may be varied to suit cluster and network performance

Setting up the cluster nodes (it's important to set not the server hostname but the name identified in the file hosts with the Heartbeat Interface):

node1-hb (name bound in file hosts with heartbeat IP)

node2-hb (name bound in file hosts with heartbeat IP)

Create fence device:

node1-ilo --> user_ilo/pwd_ilo - node1-ilo
node2-ilo --> user_ilo/pwd_ilo - node2-ilo

Associate every cluster node with its own fence device

node01-hb -> fence device node1-ilo)

Tensting if fence system it's ok from command line

ILO 2 (old HP)

#fence_ilo -a node1-ilo -l user_ilo -o reboot -p pwd_ilo -v

ILO 3 (nuovi HP G7)

enable lanplus -P

increase timeout at 4 second with -T

#fence_ipmilan -P -a "ip_address_fence_device" -l cluster -p cluster2011 -o reboot -T 4

(see HP Resource)

Tensting if fence system it's ok from daemon

#fence_node nome_cluster_node (node1-hb)

Creating a FailOverDomain
(this is the group of server of the cluster over the service can switch)

OK! now you can start service CMAN e see the log if all the node creating and joining to the cluster.

#service cman start

(starting only cman service for checking the cluster infrastructure)

Creating Resource for the Clustered Service

creating IP resource with "Monitor Link"

creating FileSystem resource with "ForceUmount"

Creating Service

AutoStart = yes

RunExclusive = no

FailoverDomain = XXXXX

RecoveryPolicy = Relocate

Aling cluster conf to all server

the first time you must manual aling the conf file
#scp /etc/cluster/cluster.xml cluser_node_2:/etc/cluster/cluster.xml

if the service "cman" is alredy started you can user a command
# ccs_tool update /etc/cluster/cluster.conf

Reload Cluster Config after a manual upgrade with system online

# vi /etc/cluster/cluster.conf
modify the first line
config_version="12" name="XXXXXXXX">

# ccs_tool update /etc/cluster/cluster.conf

Config file updated from version 11 to 12

Update complete.

Test functionality of service configured in cluster before starting rgmanager service

testing with apache service

# rg_test test /etc/cluster/cluster.conf stop service APACHE

Running in test mode.

Stopping APACHE...

Verifying Configuration Of apache:Apache

Checking Syntax Of The File /etc/httpd/conf/httpd.conf

Checking Syntax Of The File > Succeed

Stopping Service apache:Apache

Stopping Service apache:Apache > Succeed

unmounting /files_db

Removing IPv4 address 10.154.4.160/22 from bond0

Stop of APACHEcomplete

# rg_test test /etc/cluster/cluster.conf start service APACHE

Running in test mode.

Starting APACHE...

Link for bond0: Detected

Adding IPv4 address 10.154.4.160/22 to bond0

Sending gratuitous ARP: 10.154.4.160 00:23:7d:56:fb:f8 brd ff:ff:ff:ff:ff:ff

mounting /dev/mapper/mpath0 on /files_db

mount -t ext3 /dev/mapper/mpath0 /files_db

Verifying Configuration Of apache:Apache

Checking Syntax Of The File /etc/httpd/conf/httpd.conf

Checking Syntax Of The File > Succeed

Starting Service apache:Apache

Looking For IP Addresses

1 IP addresses found for APACHE/Apache

Looking For IP Addresses > Succeed - IP Addresses Found

Checking: SHA1 checksum of config file /etc/cluster/apache/apache:Apache/httpd.conf

Checking: SHA1 checksum > succeed

Generating New Config File /etc/cluster/apache/apache:Apache/httpd.conf From /etc/httpd/conf/httpd.conf

Generating New Config File /etc/cluster/apache/apache:Apache/httpd.conf From /etc/httpd/conf/httpd.conf > Succeed

Starting Service apache:Apache > Succeed

Start of APACHEcomplete

If the test is ok, we are ready to start rgmanager service.

#service rgmanager start

In the directory

# cd /usr/share/cluster

you can list some example for service from RedHat

# mkdir /etc/cluster/script

# cp /usr/share/cluster/oracledb.sh /etc/cluster/script/

Testing Infrastructure:

- test manual relocation of service
#clusvcadm -d service (disable)
#clusvcadm -e service (enable)
#clusvcadm -r service (relocate)

- test automatic relocation or in place restart (depending of your conf) of the service, in case a resource is fault
node1#tail -f /var/log/messages
node2#umount /file_system
you must read a restart or relocation of the service in log

- test manual fencing of the node,with restart/relocate of service and following re-join to the cluster of the server
#fence_ipmilan -P -a "ip_address_fence_device" -l cluster -p cluster2011 -o reboot -T 4

- test a node-crash, one node in kernel-panic, the other must fence it and relocate the service

node-1# echo 1> /proc/sys/kernel/sysrq# echo c > /proc/sysrq-trigger

node-2# tail -f /var/log/messages
- test a network heartbeat crash
# ifdown ethXXX - interfaccia sulla network-cluster/interlink/heartbeat

At the end it's simple......

Pagine

giovedì 4 agosto 2011

Red Hat Cluster Suite RHCS

Nessun commento:

Posta un commento