Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Section
Column

After configuring redundancy environment using MCCS, some failures might occur.
This chapter will explain how MCCS detects the failure and administrates after failure or failover is done. 
(In the following example, the operating server as 'Active', standby server name as 'Standby' is registered on MCCS.)

Column
width350px
Panel

Table of Contents

Table of Contents
maxLevel4

 

How to use EMS(Emergency Message Service)

MCCS has a bundled product called EMS(Emergency Message service) that automatically sends SMS to the defined admin members in charge of critical events. 
In addition, since console is web-based management, whenever an error or fault occurs, it can be managed anywhere that has internet service. Plus, Failures records in the past, management, reporting are all very easy to use.

EMS Component

EMS Agent

It is a program installed in the server to connect with EMS server.

EMS Server

It is an installed server program from the product provider company of MCCS.

EMS Workflow

Save Log

EMS Agent saves logs.
EMS server can specify logs by its type using 'LogType' attribute as shown below.

H

It saves the logs related to HA (MCCS).
(It can only specify file monitor.)

A

It saves logs related to application.
(It can only specify file monitor.)

S

It saves event log of Windows system.
(It can only specify Windows event monitor.)

P

It saves log related to process.
(It can only monitor specified file.)


Log Analysis

EMS Server users can set failure level of the system that wants to receive EMS service.
EMS server uses failure level that is set to filter EMS Agent system of operating server and analysis log to determine if it is a failure


SMS Notification

After failure monitoring for given filter is checked, EMS will send the SMS to the system operator and MCCS server operator so that it can be dealt quickly.


After connecting to EMS server, analysis cause of failure

System operator and MCCS service operator can access to the EMS server where anywhere with internet connection to check on the log and analyze the cause.
In addition, in case of manufacturing customer, it provides a centered monitoring system for all the servers in the factory and also provides a statistic of periodical failure type and trouble-shooting solutions.
The following graph is the workflow if EMS system.

[Figure] Workflow of EMS System


Control Monitoring of EMS Server Consolidated Web-based dashboard of EMS Server

Following is a part of consolidated web-based dashboard of EMS Server. 
Servers with failures are shown in red, servers that had failure and had notified to the server operators are shown in yellow, and servers that operate normally are shown in blue.
Users registered in EMS server are the only ones that can monitor the dashboard. 

Image RemovedImage Added

[Figure] Redundant server monitoring view of EMS system


Image RemovedImage Added

[Figure] Statistic view of EMS system 


...

    • Replication (Mirroring) Network Failure

      When the copying network failed, data copying cannot be done. The mirror disk resource of MCCS web console displays the 'Disconnect' ( ).

      [Figure]Failure in Replicated Network


...

    • Single Network Switch Fault

      When failure occurs in network switch connected to Public Network where it is configured by single network switch, all the resources in active and standby server will be taken offline, resources where failure occurs will show as 'fault'.

      [Figure] Failure in Network Switch    

  1. Network switch failure can be checked on MCCS log, OS System log. If failure occurs in service network connection, server operator should check on the TCP/IP of server, physical connection check on the service network through ping test.
  2. If you want the sign of failures to be removed automatically, enter a positive number in AutoFaultClearTime of the group attribute.
  3. Please get the supports regarding the recovery of Network switch failure through manufacturer.



...

  • Target Disk Failure
    If the disk at the standby server failed, the disk resource icon of MCCS web console is not changed. But the attribute values of Diskstate is changed from UptoDate to Diskless. However, the service running in the source server is not affected by it. 

    [Figure] Failure in Target Disk

  1. When MCCS detects failures of the target disk, only the DiskState value of the disk is displayed. 
  2. Disk failure can be caused by the following. After resolving the above issues, the OS will detect the newly changed disk again. After that, DRBD will proceed with synchronization.
    • Disk controller problems or H/W problems should be fixed by the manufacturers.
    • Physical disk problems or H/W problems should be fixed by the manufacturers.
  3. If the mirror disk does not perform synchronization, delete the mirror disk resource and try to create it again. But when you delete the resource, you must also delete the created mirror and create them again. 

...

  1. Due to the failure of source server (A), a failover occurs.
  2. The role of target server (B) is changed to Primary. (Mirror disk role changed)
  3. Reboot the initial source server (A).
  4. After the initial source server (A) boots, check the role of the target server (B).
  5. Check the GI value for the both nodes.
  6.  Check if the GI data is matched, and if it is, proceed with data synchronization automatically. (5th/ 6th processes checking fails.)
  7. GI data is not matched. So, synchronization is required on one node. No automatic synchronization is taking place.  (Split brain occurred.)
     
    When this state is reached, the icon of mirror disk resource is overlapped in the MCCS web console. Both the  and the 'SplitBrainStatus' attribute values will be set to true.
    In this case, you need to manually change the mirror disk role. After that, you need to resynchronize it.
    If you want to manually change the mirror disk role, you need to use the MCCS web console. 

...

  1. Check the resource attribute view. 

    [Figure] Verify SplitBrain of MirrorDisk


  2. Check the mirror management view.


    [Figure] Checking Mirror Disk Split Brains

    Warning

    1) The ConnectState of both node is StandAlone and SplitBrainStatus values are set to True.
    2) Check LastMirrorOnlineTime on the mirror disk. (LastMirrorOnlineTime is the system time. So, it is not the absolute value used to determine whether it is the latest data.)

    3) When a split brain occurs, the log will be displayed.
    (DRBD volume (r0) has a split brain.)
    4) In the mirror management window, the mirror condition is set to 'SPLIT'.

  3. Select the mirror disk and right click with your mouse button and click on 'Resolve Split Brains'.

    [Figure] Split Brain Resolving Selected


  4. Display the window to explain split brains.
     
    [Figure] Checking the Source Node Selection

  5. Select the source node.

    [Figure] Source Roll Node Selection


  6. Recheck the selected source node.

    [Figure] Rechecking the Source Node Selection


  7. Split brains problems being resolved.

    [Figure] Split Brain Resolved


  8. Resolving split brains problems is finished.

    [Figure] Resolving Split Brain Finished


  9. The selected node becomes the source node and the mirror disk condition is changed from DiskState to UpToData. 

    [Figure] Split Brain Resolve


    Warning

    The changed information of node B will be all overwritten.

...

When various letters exist in one disk and when register one letter, other letters cannot access

SCSI Lock disk supports basic disk and single lettersupport single disk device. Please do not use the disk that uses dynamic disk or multiple volume(use one LUN to configure various partition).

...

You must first define the letter disk device and request activation before the information of DUID connected to the letter is recorded in main.json.

...

  1.  In the MCCS web console, click 'File' on the menu bar to collect support files.

    [Figure] Collecting Support Files from Menu Bar  

  2. Support files can be collected by clicking the toolbar shown in the figure below.

    [Figure] Collecting Support Files from Toolbar

  3. You can select a node to collect support files from and get the previous support file again.

    [Figure] Support File Node Selection and Previous Support File Selection

  4. Click 'OK' button and support file is collected.

    [Figure] Support Files Being Collected

     

    Info

    It may take several minutes depending on the log file capacity and the network condition.

    As shown below, you can download it from the download window.
    Image Removed

  5. The collected support files can be checked in the designated location.

    [Figure] Support Files

...