Appendix A. Administrating after failure

Appendix A. Administrating after failure

 

 

How to use EMS(Emergency Message Service)

MCCS has a bundled product called EMS(Emergency Message service) that automatically sends SMS to the defined admin members in charge of critical events.
In addition, since console is web-based management, whenever an error or fault occurs, it can be managed anywhere that has internet service. Plus, Failures records in the past, management, reporting are all very easy to use.
Please refer to "EMS User Guide and EMS Agent Installation Guide" for more details about EMS system.

EMS Component

EMS Agent

It is a program installed in the server to connect with EMS server.

EMS Server

It is an installed server program from the product provider company of MCCS.

 

EMS Workflow

Save Log

EMS Agent saves logs.
EMS server can specify logs by its type using 'LogType' attribute as shown below.

H

It saves the logs related to HA (MCCS).
(It can only specify file monitor.)

A

It saves logs related to application.
(It can only specify file monitor.)

S

It saves event log of Windows system.
(It can only specify Windows event monitor.)

P

It saves log related to process.
(It can only monitor specified file.)

 

Log Analysis

EMS Server users can set failure level of the system that wants to receive EMS service.
EMS server uses failure level that is set to filter EMS Agent system of operating server and analysis log to determine if it is a failure

 

SMS Notification

After failure monitoring for given filter is checked, EMS will send the SMS to the system operator and MCCS server operator so that it can be dealt quickly.

 

After connecting to EMS server, analysis cause of failure

System operator and MCCS service operator can access to the EMS server where anywhere with internet connection to check on the log and analyze the cause.
In addition, in case of manufacturing customer, it provides a centered monitoring system for all the servers in the factory and also provides a statistic of periodical failure type and trouble-shooting solutions.
The following graph is the workflow if EMS system.

[Figure] Workflow of EMS System

 

Control Monitoring of EMS Server Consolidated Web-based dashboard of EMS Server

Following is a part of consolidated web-based dashboard of EMS Server.
Servers with failures are shown in red, servers that had failure and had notified to the server operators are shown in yellow, and servers that operate normally are shown in blue.
Users registered in EMS server are the only ones that can monitor the dashboard. 

[Figure] Redundant server monitoring view of EMS system

 

[Figure] Statistic view of EMS system

Server Failure

This is the case of system being rebooting or shut down because of conflicts of each device (NIC, Raid Controller), kernel driver problem of other application.

Active Server Failure

  1.  There is no difference in the MCCS role resulting from abnormal or normal termination of the server. MCCS will perform a failover to the standby server when the operation server fails.
    In the node management menu on the right side of the screen, select the server. You can check the details of failures in the 'Resource Status' & 'Resource Dependency' screens.
    Since data cannot be replicated due to the server failure, 

    will be shown in the mirror disk resource.

    • Normal Termination of a system
      This is a case where user selected 'system shutdown' in operating systems.

    • Abnormal Termination of a system 
      This is a case where system is terminated or rebooted due to an unexpected situation or blue screen.


    Figure] Failure in Active Server

  2. Server operators check on the failure and put the server back to normal.

  3. After checking on the mirror role of two servers when server with the failure is rebooted, switch the server with the failure as replication target and proceed partial resync.

Standby Server Failure

  1. MCCS will show the failure when failure occurs in standby server.

  2. Data replication will be paused until standby server is back to normal.


    Figure] Failure in Standby Server

  3. If I/O keep happens, data is impossible to replicate and mirror disk will be shown as 'Paused'(

    )

  4. If there is no I/O, icon of mirror disk has no change but failure messages related mirror disk exists in MCCS log.

  5. Even if the standby server failed, it does not affect operation. But as there is no server to perform failover to, the server operator must check the trouble in the MCCS web console and make sure that the standby server is normalized in time.

  6. When standby server is back to normal, it will recover from 'Paused' to 'Normal' and 

     icon will be disappeared.