8- Administering after failure
After configuring redundancy environment using MCCS, some failures might occur.
This chapter will explain how MCCS detects the failure and administrates after failure or failover is done.
(In the following example, the operating server as 'Active', standby server name as 'Standby' is registered on MCCS.)
Table of Contents
Server Failure
This is the case of system being rebooting or shut down because of conflicts of each device (NIC, Raid Controller), kernel driver problem of other application.
Active Server Failure
Standby Server Failure
MCCS will show the failure when failure occurs in standby server.
Data replication will be paused until standby server is back to normal.
[Figure] Failure in Standby ServerData synchronization cannot be achieved. Mirror disk becomes the 'Network Connection Failure' state .
Even if the standby server failed, it does not affect operation. But as there is no server to perform failover to, the server operator must check the trouble in the MCCS web console and make sure that the standby server is normalized in time.
When the standby server is normalized, the icon disappears.
Mirror disk's DiskState value is changed from 'Inconsistent' to 'UpToDate' and for this purpose, data synchronization is performed.
When the synchronization is finished, the current normal data is synchronized in real time. The icon is changed to .
Application Failure
Active application resources are operated by 4 elements below.
MonitorInterval (Default Value=10sec)
Monitors the resource with interval set value.MonitorTimeout (Default Value=10sec)
If there is no reply as much as the set value, it is considered as a failure.RestartLimit (Default Value=0)
It will restart the application resource as the set value.OnlieTrustTime (Default Value=600sec)
It re-sets the time of number of resource restarting number.It is the time to reset the frequency of the resource to restart.
Attributes above are the set value of the registered being added the resource, and users can check or change the values through Resource Attribute view of MCCS console.
[Figure] Resource attribute value Edit
MCCS periodically monitors the resources referring 'MonitorInterval'.
If there is no response as the time set in 'MonitorTimeout', it is considered as a failure.
If there are no response after sending the command as the number set in 'RestartLimit', MCCS will failover the group which resource belongs to.
If the resource stays in normal state within the time limit set by 'OnlineTrustTime'. MCCS will initialize the attribute value of 'RestartLimit'. This is to ensure restart number when failure occurs in a resource.
If there is a failover due to a failure in the resource, server operator checks on the problem and put it back to normal.
In the MCCS web console, a user can see where the trouble occurs. After a user checks the trouble area, they must remove the Trouble sign, so that the failover function can be activated again.
If you want the sign of failures to be removed automatically, enter a positive number in AutoFaultClearTime of the group attribute.After checking on the mirror role of two servers when server with the failure is rebooted, switch the server with the failure as replication target and proceed partial resync.
[Figure] Failure in Resource Clear
Network Failure
Network failure happens when network connection has problem, such as network switch or network interface card is broken or disconnection in network cable, or ping timeout of some network and so on.
Since MCCS license referenced to MAC address, license should be reissued if there is a change in network interface card.
Service Network Failure
If failure occurs in service network of active server, the fault mark will be shown on the network interface card resource or IP address of the node in MCCS UI, and will failover to the standby server.
[Figure] Failure in Network Interface Card
In the MCCS web console, you can check in which part of service network, trouble has occurred.
MCCS checks network cable disconnection of server where network failure occurred, and whether ping timeout occurs from network.
If IP address resource is the cause of the failure, user should check on the network switch or network interface card.
When physical parts related to network is back to normal, select 'Clear Fault' from the MCCS web console and remove fault mark in order to re-enable the failover function.If you want the sign of failures to be removed automatically, enter a positive number in AutoFaultClearTime of the group attribute.
Heartbeat failure can be checked on MCCS log, Window System log. If failure occurs in heartbeat line, server operator should check on the TCP/IP of server, physical connection check on the heartbeat through ping test.
If it is an abnormal situation, check on card, cable connection or cable disconnection and clear the cause of the failure.
Replication network failure can be checked on MCCS log, OS System log. If failure occurs in replication network, server operator should check on the TCP/IP of server, physical connection check on the replication network through ping test.
If it is an abnormal situation, check on card, cable connection or cable disconnection and clear the cause of the failure.
Network switch failure can be checked on MCCS log, OS System log. If failure occurs in service network connection, server operator should check on the TCP/IP of server, physical connection check on the service network through ping test.
If you want the sign of failures to be removed automatically, enter a positive number in AutoFaultClearTime of the group attribute.
Please get the supports regarding the recovery of Network switch failure through manufacturer.