Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.


Section


Column

After configuring redundancy environment using MCCS, some failures might occur.
This chapter will explain how MCCS detects the failure and administrates after failure or failover is done. 
(In the following example, the operating server as 'Active', standby server name as 'Standby' is registered on MCCS.)


Column
width350px


Panel

Table of Contents

Table of Contents
maxLevel4



Server Failure

This is the case of system being rebooting or shut down because of conflicts of each device (NIC, Raid Controller), kernel driver problem of other application.

Active Server Failure

  1. There is no difference in the MCCS role resulting from abnormal or normal termination of the server.
    MCCS will perform a failover to the standby server when the operation server fails. In the node management menu on the right side of the screen, select the server.
    You can check the details of failures in the 'Resource Status' & 'Resource Dependency' screens.
      • Normal Termination of a system
        This is a case where user selected 'system shutdown' in operating systems.
      • Abnormal Termination of a system 
        This is a case where system is terminated or rebooted due to an unexpected situation or blue screen.

    [Figure] Failure in Active Server

  2. Since data cannot be replicated due to the server failure,  will be shown in the mirror disk resource.
  3. Server operators check on the failure and put the server back to normal.
  4. After checking on the mirror role of two servers when server with the failure is rebooted, switch the server with the failure as replication target and proceed partial resync.

Standby Server Failure

  1. MCCS will show the failure when failure occurs in standby server.
  2. Data replication will be paused until standby server is back to normal.


    [Figure] Failure in Standby Server

  3. Data synchronization cannot be achieved. Mirror disk becomes the 'Network Connection Failure' state  ).
  4. Even if the standby server failed, it does not affect operation. But as there is no server to perform failover to, the server operator must check the trouble in the MCCS web console and make sure that the standby server is normalized in time.
  5. When the standby server is normalized, the  icon disappears.
  6. Mirror disk's DiskState value is changed from 'Inconsistent' to 'UpToDate' and for this purpose, data synchronization () is  is performed. 
  7. When the synchronization is finished, the current normal data is synchronized in real time. The icon is changed to ()

Application Failure

Active application resources are operated by 4 elements below.

...

  1. MCCS periodically monitors the resources referring  'MonitorInterval'.
  2. If there is no response as the time set in 'MonitorTimeout', it is considered as a failure.
  3. If there are no response after sending the command as the number set in 'RestartLimit', MCCS will failover the group which resource belongs to.
  4. If the resource stays in normal state within the time limit set by 'OnlineTrustTime'. MCCS will initialize the attribute value of 'RestartLimit'. This is to ensure restart number when failure occurs in a resource.
  5. If there is a failover due to a failure in the resource, server operator checks on the problem and put it back to normal.
  6. In the MCCS web console, a user can see where the trouble occurs. After a user checks the trouble area, they must remove the Trouble sign, so that the failover function can be activated again. 
    If you want the sign of failures to be removed automatically, enter a positive number in AutoFaultClearTime of the group attribute. 
  7. After checking on the mirror role of two servers when server with the failure is rebooted, switch the server with the failure as replication target and proceed partial resync.


    [Figure] Failure in Resource Clear

Network Failure

Network failure happens when network connection has problem, such as network switch or network interface card is broken or disconnection in network cable, or ping timeout of some network and so on.

Warning

Since MCCS license referenced to MAC address, license should be reissued if there is a change in network interface card.

    • Service Network Failure

      If failure occurs in service network of active server, the fault mark will be shown on the network interface card resource or IP address of the node in MCCS UI, and will failover to the standby server.

      [Figure] Failure in Network Interface Card

...

  1. In the MCCS web console, you can check in which part of service network, trouble has occurred.
  2. MCCS checks network cable disconnection of server where network failure occurred, and whether ping timeout occurs from network.
  3. If IP address resource is the cause of the failure, user should check on the network switch or network interface card.
    When physical parts related to network is back to normal, select 'Clear Fault' from the MCCS web console and remove fault mark in order to re-enable the failover function. 
  4. If you want the sign of failures to be removed automatically, enter a positive number in AutoFaultClearTime of the group attribute. 

    • Heartbeat Network Fault

      Heartbeat should be dualized because it plays a very important role of synchronizing the inter node status and determining the condition of failure. If any one of the dualized heartbeat network fails, the details of failure is displayed in the log window.
      However, the MCCS web console has no changes. It means that the operation server or the standby server has no problems. 
      At this point, when failure occurs in active server and needs to failover to the standby server, MCCS will use redundant normal heartbeat network to failover.

      If all the redundant heartbeat is disconnected, MCCS will use the service network as heartbeat line.

      [Figure] Failure in Heartbeat

  1. Heartbeat failure can be checked on MCCS log, Window System log. If failure occurs in heartbeat line, server operator should check on the TCP/IP of server, physical connection check on the heartbeat through ping test.
  2. If it is an abnormal situation, check on card, cable connection or cable disconnection and clear the cause of the failure.


    • Replication (Mirroring) Network Failure

      When the copying network failed, data copying cannot be done. The mirror disk resource of MCCS web console displays the 'Disconnect' ( ).

      [Figure]Failure in Replicated Network


  1. Replication network failure can be checked on MCCS log, OS System log. If failure occurs in replication network, server operator should check on the TCP/IP of server, physical connection check on the replication network through ping test.
  2. If it is an abnormal situation, check on card, cable connection or cable disconnection and clear the cause of the failure.


    • Single Network Switch Fault

      When failure occurs in network switch connected to Public Network where it is configured by single network switch, all the resources in active and standby server will be taken offline, resources where failure occurs will show as 'fault'.

      [Figure] Failure in Network Switch    

  1. Network switch failure can be checked on MCCS log, OS System log. If failure occurs in service network connection, server operator should check on the TCP/IP of server, physical connection check on the service network through ping test.
  2. If you want the sign of failures to be removed automatically, enter a positive number in AutoFaultClearTime of the group attribute.
  3. Please get the supports regarding the recovery of Network switch failure through manufacturer.



Disk Failure

Mirror Disk Failure

  • Source Disk Failure

    If failure occurs in disk resource of active server, MCCS GUI will show the failure. MCCS will failover to the standby server since it is impossible to Read/Write in the disk.

     

    [Figure] Failure in Mirror Disk


...

  • Target Disk Failure
    If the disk at the standby server failed, the disk resource icon of MCCS web console is not changed. But the attribute values of Diskstate is changed from UptoDate to Diskless. However, the service running in the source server is not affected by it. 

    [Figure] Failure in Target Disk

  1. When MCCS detects failures of the target disk, only the DiskState value of the disk is displayed. 
  2. Disk failure can be caused by the following. After resolving the above issues, the OS will detect the newly changed disk again. After that, DRBD will proceed with synchronization.
    • Disk controller problems or H/W problems should be fixed by the manufacturers.
    • Physical disk problems or H/W problems should be fixed by the manufacturers.
  3. If the mirror disk does not perform synchronization, delete the mirror disk resource and try to create it again. But when you delete the resource, you must also delete the created mirror and create them again. 

  • Split Brain of Mirror Disk Resource 

    This is a rare case but in this case, the mirror disk roles at the two servers are recognized as primary and the data on the web console is not matched.
    The situation arises because the existing source is not switched to the target when the source and the target need to be switched. In this case, they try to synchronize their own data but due to the mismatch between the previous data, automatic synchronization failed. 

    In the mirror disk, a split brain can occur as follows.
  1. Due to the failure of source server (A), a failover occurs.
  2. The role of target server (B) is changed to Primary. (Mirror disk role changed)
  3. Reboot the initial source server (A).
  4. After the initial source server (A) boots, check the role of the target server (B).
  5. Check the GI value for the both nodes.
  6.  Check if the GI data is matched, and if it is, proceed with data synchronization automatically. (5th/ 6th processes checking fails.)
  7. GI data is not matched. So, synchronization is required on one node. No automatic synchronization is taking place.  (Split brain occurred.)
     
    When this state is reached, the icon of mirror disk resource is overlapped in the MCCS web console. Both the  and the 'SplitBrainStatus' attribute values will be set to true.
    In this case, you need to manually change the mirror disk role. After that, you need to resynchronize it.
    If you want to manually change the mirror disk role, you need to use the MCCS web console. 

...

  1. Check the resource attribute view. 

    [Figure] Verify SplitBrain of MirrorDisk


  2. Check the mirror management view.


    [Figure] Checking Mirror Disk Split Brains

    Warning

    1) The ConnectState of both node is StandAlone and SplitBrainStatus values are set to True.
    2) Check LastMirrorOnlineTime on the mirror disk. (LastMirrorOnlineTime is the system time. So, it is not the absolute value used to determine whether it is the latest data.)

    3) When a split brain occurs, the log will be displayed.
    (DRBD volume (r0) has a split brain.)
    4) In the mirror management window, the mirror condition is set to 'SPLIT'.


  3. Select the mirror disk and right click with your mouse button and click on the 'Resolve Split Brains' button.

    [Figure] Split Brain Resolving Selected


  4. Display the window to explain split brains.
     
    [Figure] Checking the Source Node Selection

  5. Select the source node.

    [Figure] Source Roll Node Selection

  6. Recheck Check again the selected source node.

    [Figure] Rechecking the Source Node Selection


  7. Split brains problems being resolved.

    [Figure] Split Brain Resolved


  8. Resolving split brains problems is finished.

    [Figure] Resolving Split Brain Finished


  9. The selected node becomes the source node and the mirror disk condition is changed from "DiskState" to "UpToData"

    [Figure] Split Brain Resolve


    Warning

    The changed information of node B will be all overwritten.


External Storage Failure

When the external disk fails or has a bad connection path, you cannot read/write the disk. So, MCCS will display the sign of failure and proceed with a failover.

...

  1. External storage failure can be checked through MCCS log, System log.

  2. If there is a problem in external storage, service is stopped until the storage recovers. Therefore, storage should be recovered in a short period of time or it should be replaced to other one (back up storage).

  3. Problems related to the external storage should be dealt with the vendor.

  4. When the server of external storage connection and disk where failure occurs is back to normal, Server should be rebooted so that MCCS Kernel Driver can identify the recovered environment.

  5. Also, redundancy measures should be solved from storage vendor.

SCSI Lock Failure

When interlock with volume manager using SCSI3-PR

...

Reservation is canceled when an SCSI Lock agent is deleted. When you delete it, you must consider the fact that the shared disk to be reserved can be used at the other node. In other words, when you delete it, you must make sure the other node is down.



Ways to collect support files

When problems occur in MCCS, support file must be collected to collect log and preference information.
There are 2 ways to collect support file.

How to collect by using the web console

  1.  In the MCCS web console, click 'File' on the menu bar to collect support files.

    [Figure] Collecting Support Files from Menu Bar  

  2. Support files can be collected by clicking the toolbar shown in the figure below.

    [Figure] Collecting Support Files from Toolbar

  3. You can select a node to collect support files from and get the previous support file again.

    [Figure] Support File Node Selection and Previous Support File Selection

  4. Click 'the "OK' " button and support file is collected.

    [Figure] Support Files Being Collected


    Info

    It may take several minutes depending on the log file capacity and the network condition.


  5. The collected support files can be checked in the designated location.

    [Figure] Support Files


Collecting file using script files


Script file is located as below:

...