Appendix A. Administrating after failure

Appendix A. Administrating after failure

 

After configuring redundancy environment using MCCS, some failures might occur.
This chapter will explain how MCCS detects the failure and administrates after failure or failover is done. 
(In the following example, the operating server as 'Active', standby server name as 'Standby' is registered on MCCS.)

 

Table Of Contents

 

Server Failure

This is the case of system being rebooting or shut down because of conflicts of each device (NIC, Raid Controller), kernel driver problem of other application.

Active Server Failure

  1.  There is no difference in the MCCS role resulting from abnormal or normal termination of the server. MCCS will perform a failover to the standby server when the operation server fails.
    In the node management menu on the right side of the screen, select the server. You can check the details of failures in the 'Resource Status' & 'Resource Dependency' screens.
    Since data cannot be replicated due to the server failure, 

    will be shown in the mirror disk resource.

    • Normal Termination of a system
      This is a case where user selected 'system shutdown' in operating systems.

    • Abnormal Termination of a system 
      This is a case where system is terminated or rebooted due to an unexpected situation or blue screen.


    Figure] Failure in Active Server

  2. Server operators check on the failure and put the server back to normal.

  3. After checking on the mirror role of two servers when server with the failure is rebooted, switch the server with the failure as replication target and proceed partial resync.

Standby Server Failure

  1. MCCS will show the failure when failure occurs in standby server.

  2. Data replication will be paused until standby server is back to normal.


    Figure] Failure in Standby Server

  3. If I/O keep happens, data is impossible to replicate and mirror disk will be shown as 'Paused'(

    )

  4. If there is no I/O, icon of mirror disk has no change but failure messages related mirror disk exists in MCCS log.

  5. Even if the standby server failed, it does not affect operation. But as there is no server to perform failover to, the server operator must check the trouble in the MCCS web console and make sure that the standby server is normalized in time.

  6. When standby server is back to normal, it will recover from 'Paused' to 'Normal' and 

     icon will be disappeared.

Application, Process and Window Service Failure

Active application, process and window service resources are operated by 4 elements below.

  • MonitorInterval
    Monitors the resource with interval set value. (Default Value=10sec)

  • MonitorTimeout
    If there is no reply as much as the set value, it is considered as a failure. (Default Value=10sec)

  • RestartLimit
    It will restart the application resource as the set value. (Default Value=0)

  • OnlieTrustTime
    It re-sets the time of number of resource restarting number.It is the time to reset the frequency of the resource to restart. (Default Value=600sec)
    Attributes above are the set value of the registered being added the resource, and users can check or change the values through Resource Attribute view of MCCS console. 


    [Figure] Resource attribute value Edit

  1. MCCS periodically monitors the resources referring  'MonitorInterval'.

  2. If there is no response as the time set in 'MonitorTimeout', it is considered as a failure.

  3. If there are no response after sending the command as the number set in 'RestartLimit', MCCS will failover the group which resource belongs to.

  4. If the resource stays in normal state within the time limit set by 'OnlineTrustTime'. MCCS will initialize the attribute value of 'RestartLimit'. This is to ensure restart number when failure occurs in a resource.

  5. If there is a failover due to a failure in the resource, server operator checks on the problem and put it back to normal.

  6. In the MCCS web console, a user can see where the trouble occurs. After a user checks the trouble area, they must remove the Trouble sign, so that the failover function can be activated again. 

  7. After checking on the mirror role of two servers when server with the failure is rebooted, switch the server with the failure as replication target and proceed partial resync.


    [Figure] Failure in Resource Clear

Network Failure

Network failure happens when network connection has problem, such as network switch or network interface card is broken or disconnection in network cable, or ping timeout of some network and so on.

※ Since MCCS license referenced to MAC address, license should be reissued if there is a change in network interface card.

  • Service Network Failure

    If failure occurs in service network of active server, the fault mark will be shown on the network interface card resource or IP address of the node in MCCS web console, and will failover to the standby server.


    [Figure] Failure in Network Interface Card

     

  1. In the MCCS web console, you can check in which part of service network, trouble has occurred.

  2. MCCS checks network cable disconnection of server where network failure occurred, and whether ping timeout occurs from network.

  3. If IP address resource is the cause of the failure, user should check on the network switch or network interface card.
    When physical parts related to network is back to normal, select 'Clear Fault' from the MCCS console and remove fault mark in order to re-enable the failover function. 

  4. If you want the sign of failures to be removed automatically, enter a positive number in AutoFaultClearTime of the group attribute.