Appendix A- Administrating after failure

Appendix A- Administrating after failure

After configuring redundancy environment using MCCS, some failures might occur.
This chapter will explain how MCCS detects the failure and administrates after failure or failover is done. 
(In the following example, the operating server as 'Active', standby server name as 'Standby' is registered on MCCS.)

Table of Contents

 

How to use EMS(Emergency Message Service)

MCCS has a bundled product called EMS(Emergency Message service) that automatically sends SMS to the defined admin members in charge of critical events. 
In addition, since console is web-based management, whenever an error or fault occurs, it can be managed anywhere that has internet service. Plus, Failures records in the past, management, reporting are all very easy to use.

EMS Component

EMS Agent

It is a program installed in the server to connect with EMS server.

EMS Server

It is an installed server program from the product provider company of MCCS.

EMS Workflow

Save Log

EMS Agent saves logs.
EMS server can specify logs by its type using 'LogType' attribute as shown below.

H

It saves the logs related to HA (MCCS).
(It can only specify file monitor.)

A

It saves logs related to application.
(It can only specify file monitor.)

S

It saves event log of Windows system.
(It can only specify Windows event monitor.)

P

It saves log related to process.
(It can only monitor specified file.)

 

Log Analysis

EMS Server users can set failure level of the system that wants to receive EMS service.
EMS server uses failure level that is set to filter EMS Agent system of operating server and analysis log to determine if it is a failure

 

SMS Notification

After failure monitoring for given filter is checked, EMS will send the SMS to the system operator and MCCS server operator so that it can be dealt quickly.

 

After connecting to EMS server, analysis cause of failure

System operator and MCCS service operator can access to the EMS server where anywhere with internet connection to check on the log and analyze the cause.
In addition, in case of manufacturing customer, it provides a centered monitoring system for all the servers in the factory and also provides a statistic of periodical failure type and trouble-shooting solutions.
The following graph is the workflow if EMS system.

[Figure] Workflow of EMS System

 

Control Monitoring of EMS Server Consolidated Web-based dashboard of EMS Server

Following is a part of consolidated web-based dashboard of EMS Server. 
Servers with failures are shown in red, servers that had failure and had notified to the server operators are shown in yellow, and servers that operate normally are shown in blue.
Users registered in EMS server are the only ones that can monitor the dashboard. 

[Figure] Redundant server monitoring view of EMS system

 

[Figure] Statistic view of EMS system 

 

Server Failure

This is the case of system being rebooting or shut down because of conflicts of each device (NIC, Raid Controller), kernel driver problem of other application.

Active Server Failure

Standby Server Failure

  1. MCCS will show the failure when failure occurs in standby server.

  2. Data replication will be paused until standby server is back to normal.


    [Figure] Failure in Standby Server

  3. Data synchronization cannot be achieved. Mirror disk becomes the 'Network Connection Failure' state ( 

     ).

  4. Even if the standby server failed, it does not affect operation. But as there is no server to perform failover to, the server operator must check the trouble in the MCCS web console and make sure that the standby server is normalized in time.

  5. When the standby server is normalized, the 

     icon disappears.

  6. Mirror disk's DiskState value is changed from 'Inconsistent' to 'UpToDate' and for this purpose, data synchronization (

    ) is performed. 

  7. When the synchronization is finished, the current normal data is synchronized in real time. The icon is changed to 

Application Failure

Active application resources are operated by 4 elements below.

  • MonitorInterval (Default Value=10sec)
    Monitors the resource with interval set value.

  • MonitorTimeout (Default Value=10sec)
    If there is no reply as much as the set value, it is considered as a failure.

  • RestartLimit (Default Value=0)
    It will restart the application resource as the set value.

  • OnlieTrustTime (Default Value=600sec) 
    It re-sets the time of number of resource restarting number.It is the time to reset the frequency of the resource to restart.
    Attributes above are the set value of the registered being added the resource, and users can check or change the values through Resource Attribute view of MCCS console. 


    [Figure] Resource attribute value Edit