Appendix A- Administrating after failure
After configuring redundancy environment using MCCS, some failures might occur.
This chapter will explain how MCCS detects the failure and administrates after failure or failover is done.
(In the following example, the operating server as 'Active', standby server name as 'Standby' is registered on MCCS.)
Table of Contents
How to use EMS(Emergency Message Service)
MCCS has a bundled product called EMS(Emergency Message service) that automatically sends SMS to the defined admin members in charge of critical events.
In addition, since console is web-based management, whenever an error or fault occurs, it can be managed anywhere that has internet service. Plus, Failures records in the past, management, reporting are all very easy to use.
EMS Component
EMS Agent
It is a program installed in the server to connect with EMS server.
EMS Server
It is an installed server program from the product provider company of MCCS.
EMS Workflow
Save Log
EMS Agent saves logs.
EMS server can specify logs by its type using 'LogType' attribute as shown below.
H
It saves the logs related to HA (MCCS).
(It can only specify file monitor.)
A
It saves logs related to application.
(It can only specify file monitor.)
S
It saves event log of Windows system.
(It can only specify Windows event monitor.)
P
It saves log related to process.
(It can only monitor specified file.)
Log Analysis
EMS Server users can set failure level of the system that wants to receive EMS service.
EMS server uses failure level that is set to filter EMS Agent system of operating server and analysis log to determine if it is a failure
SMS Notification
After failure monitoring for given filter is checked, EMS will send the SMS to the system operator and MCCS server operator so that it can be dealt quickly.
After connecting to EMS server, analysis cause of failure
System operator and MCCS service operator can access to the EMS server where anywhere with internet connection to check on the log and analyze the cause.
In addition, in case of manufacturing customer, it provides a centered monitoring system for all the servers in the factory and also provides a statistic of periodical failure type and trouble-shooting solutions.
The following graph is the workflow if EMS system.
[Figure] Workflow of EMS System
Control Monitoring of EMS Server Consolidated Web-based dashboard of EMS Server
Following is a part of consolidated web-based dashboard of EMS Server.
Servers with failures are shown in red, servers that had failure and had notified to the server operators are shown in yellow, and servers that operate normally are shown in blue.
Users registered in EMS server are the only ones that can monitor the dashboard.
[Figure] Redundant server monitoring view of EMS system
[Figure] Statistic view of EMS system
Server Failure
This is the case of system being rebooting or shut down because of conflicts of each device (NIC, Raid Controller), kernel driver problem of other application.
Active Server Failure
Standby Server Failure
MCCS will show the failure when failure occurs in standby server.
Data replication will be paused until standby server is back to normal.
[Figure] Failure in Standby ServerData synchronization cannot be achieved. Mirror disk becomes the 'Network Connection Failure' state ( ).
Even if the standby server failed, it does not affect operation. But as there is no server to perform failover to, the server operator must check the trouble in the MCCS web console and make sure that the standby server is normalized in time.
When the standby server is normalized, the icon disappears.
Mirror disk's DiskState value is changed from 'Inconsistent' to 'UpToDate' and for this purpose, data synchronization () is performed.
When the synchronization is finished, the current normal data is synchronized in real time. The icon is changed to .
Application Failure
Active application resources are operated by 4 elements below.
MonitorInterval (Default Value=10sec)
Monitors the resource with interval set value.MonitorTimeout (Default Value=10sec)
If there is no reply as much as the set value, it is considered as a failure.RestartLimit (Default Value=0)
It will restart the application resource as the set value.OnlieTrustTime (Default Value=600sec)
It re-sets the time of number of resource restarting number.It is the time to reset the frequency of the resource to restart.
Attributes above are the set value of the registered being added the resource, and users can check or change the values through Resource Attribute view of MCCS console.
[Figure] Resource attribute value Edit