|
MCCS has a bundled product called EMS(Emergency Message service) that automatically sends SMS to the defined admin members in charge of critical events.
In addition, since console is web-based management, whenever an error or fault occurs, it can be managed anywhere that has internet service. Plus, Failures records in the past, management, reporting are all very easy to use.
Please refer to "EMS User Guide and EMS Agent Installation Guide" for more details about EMS system.
EMS Agent
It is a program installed in the server to connect with EMS server.
EMS Server
It is an installed server program from the product provider company of MCCS.
Save Log
EMS Agent saves logs.
EMS server can specify logs by its type using 'LogType' attribute as shown below.
H
It saves the logs related to HA (MCCS).
(It can only specify file monitor.)
A
It saves logs related to application.
(It can only specify file monitor.)
S
It saves event log of Windows system.
(It can only specify Windows event monitor.)
P
It saves log related to process.
(It can only monitor specified file.)
Log Analysis
EMS Server users can set failure level of the system that wants to receive EMS service.
EMS server uses failure level that is set to filter EMS Agent system of operating server and analysis log to determine if it is a failure
SMS Notification
After failure monitoring for given filter is checked, EMS will send the SMS to the system operator and MCCS server operator so that it can be dealt quickly.
After connecting to EMS server, analysis cause of failure
System operator and MCCS service operator can access to the EMS server where anywhere with internet connection to check on the log and analyze the cause.
In addition, in case of manufacturing customer, it provides a centered monitoring system for all the servers in the factory and also provides a statistic of periodical failure type and trouble-shooting solutions.
The following graph is the workflow if EMS system.
[Figure] Workflow of EMS System
Control Monitoring of EMS Server Consolidated Web-based dashboard of EMS Server
Following is a part of consolidated web-based dashboard of EMS Server.
Servers with failures are shown in red, servers that had failure and had notified to the server operators are shown in yellow, and servers that operate normally are shown in blue.
Users registered in EMS server are the only ones that can monitor the dashboard.
[Figure] Redundant server monitoring view of EMS system
[Figure] Statistic view of EMS system
This is the case of system being rebooting or shut down because of conflicts of each device (NIC, Raid Controller), kernel driver problem of other application.
Active application, process and window service resources are operated by 4 elements below.
Network failure happens when network connection has problem, such as network switch or network interface card is broken or disconnection in network cable, or ping timeout of some network and so on.
※ Since MCCS license referenced to MAC address, license should be reissued if there is a change in network interface card. |
If failure occurs in service network of active server, the fault mark will be shown on the network interface card resource or IP address of the node in MCCS web console, and will failover to the standby server.
[Figure] Failure in Network Interface Card
At this point, when failure occurs in active server and needs to failover to the standby server, MCCS will use redundant normal heartbeat network to failover.
If all the redundant heartbeat is disconnected, MCCS will use the service network as heartbeat line.
[Figure] Failure in Heartbeat
When failure occurs in replication network, data cannot be replicated and it will be shown as 'Paused()' in mirror disk resource of MCCS web console.
[Figure] Failure in Replicated Network
When failure occurs in network switch connected to Public Network where it is configured by single network switch, all the resources in active and standby server will be taken offline, resources where failure occurs will show as 'fault'.
[Figure] Failure in Network Switch
Source Disk Failure
If failure occurs in disk resource of active server, MCCS web console will show the failure. MCCS will failover to the standby server since it is impossible to Read/Write in the disk.
[Figure] Failure in Mirror Disk
Target Disk Failure
If failure occurs in disk resource of active server, MCCS web console will show as 'Paused'. It does not affect the operating service of active server.
[Figure] Failure in Target Disk
This happens rarely but mirror disks identifies as source on both servers. This happens in the process of changing from existing source cannot be changed to target source.
Both servers will try to synchronize the data and that cause the split brain. Split brain occurs in the situation as shown below.
In this case, icon of mirror disk resource() is shown as put on one another in MCCS web console and attribute value of 'MirrorRole' on both servers are 'Source'.
When this happens, role of mirror disk should be changed manually, and after the change, re-synchronization process happens.
There is a way to change the role of mirror disk, using MCCS web console.
How to resolve the split brain issues by using the MCCS web console
Check the resource attribute view.
[Figure] Verify SplitBrain of MirrorDisk
Check the mirror management view.
[Figure] Checking Mirror Disk Split Brains
1) The both nodes' MirrorRole is Source, and their MirrorState is MIRROR_PAUSED. |
The selected node becomes the source node and the mirror disk condition is changed to MIRRORING.
[Figure] Split Brain Resolved
The changed information of node B will be all overwritten. |
When the external disk fails or has a bad connection path, you cannot read/write the disk. So, MCCS will display the sign of failure and proceed with a failover.
[Figure] Failure in Shared Disk
If there is a problem in external storage, service is stopped until the storage recovers. Therefore, storage should be recovered in a short period of time or it should be replaced to other one (back up storage).
Problems related to the external storage should be dealt with the vendor.
When the server of external storage connection and disk where failure occurs is back to normal, Server should be rebooted so that MCCS Kernel Driver can identify the recovered environment.
Also, redundancy measures should be solved from storage vendor.
SMB which is supported on Windows 2000 or later uses Direct-Hosted. This feature is support directly file sharing service without NetBIOS interface.
To resolve name resolution for an IP address, DNS lookup occurs and not used NetBIOS name resolution.
Transmit and receive operation
Namely, if you want to work with the DNS server while using NetBIOS agent, most of clients are connected by Direct-Hosted SMB.
When verify agent action, related cache will be flushed.
Flush a NetBIOS table cache
netbtstat -R |
Flush A DNS cache
ipconfig /flushdns |
Flush an ARP cache
arp -d |
Turn off NetBIOS communication related are : the destination port number.
TCP/UDP 137,138,139, 445 |
Turn off DNS Server update, WINS Server update communication related are : the destination port number.
TCP/UDP 42, 53 |
The Workstation Service on Windows Services creates and maintains client connection for the remote server using SMB protocol.
If this service is stopped, cannot keep the connections. if this service is disabled, then this connection explicitly using the service cannot be started.
When the workstation service is stopped, you must be careful.
Service Name |
---|
Alerter Service |
Browser Service |
Messenger Service |
Net Logon Service |
RPC Locator Service |
Windows Server service supports sharing the file, print, and named pipe over the network for this computer. If this service is stopped, cannot use these features. If this is disabled, the following services have a dependency on this service will not be able to start.
At the server where a cluster is configured, if you want to use the NetBIOS agent or mirror agent, the service status must be "Started". |
When the server service is stopped, you must be careful.
Service Name |
---|
Browser Service |
At first, please verify file sharing is working as original NetBIOS computer name except virtual name.
On the client, please verify that access to node file on a regular basis is working using command like dir, start, explorer or net view.
Please verify as DIR command.
DIR command is run by following syntax.
dir \\virtual_name\shared_folder |
Please verify as START command.
START command is run by following syntax.
start \\virtual_name |
Please verify as EXPLORER command.
EXPLORER command is run by following syntax.
explorer \\virtual_name |
Please verify as NET VIEW command.
NET VIEW command is run by following syntax.
net view virtual_name |
Your computer's file and print sharing lists are created. On the specified computer, there are no file or print shares available, "there are no entries in the list" message.
When the client isn't refreshed the mapping information between virtual name and real IP address after failover occurs, the client's NetBIOS cache is not communication for a few minutes until flushed.
This case will be happened when you use WINS server. Therefore the clients program is needed to be cluster aware in this case.
When interlock with volume manager using SCSI3-PR
Volume Manager (Ex: something like SFW of Symantec that has SCSI3-PR reservation function) can be used with SCSI Lock agent.
When check if SCSI-PR is supported
To check of the disk supports SCSI-3PR function, PR type can be checked using scsicmd.cmd command.
When cannot find sg_scan.exe or sg_persist.exe pass
Check if the command exist in %MCCS_HOME%/bin.
When interlock with shared disk
When interlock shared disk agent and SCSI Lock agent, check if the shared disk agent works normally and then register SCSI Lock agent.
The purpose of disk of SCSI Lock agent is to use as a LOCK device in hardware perspective, not the contents of the disk. Therefore, size of disk can be small and it is not protected.
When registration key error occurs
Remove Reservation key and registration key using scsicmd.cmd -c or -cf command and re-set. Before registering resource, check if there is any registered key and if there is, remove the key first before registering.
Note that the current key is set automatically by its MAC address. It uses the first adapter among the network adapters. This key is automatically recorded in setting file. If key does not exist in setting file then new key is not created.
When various letters exist in one disk and when register one letter, other letters cannot access
SCSI Lock disk supports basic disk and single letter. Please do not use the disk that uses dynamic disk or multiple volume(use one LUN to configure various partition).
When maintaining the state where DUID is not solved after registering agent
You must first define the letter and request activation before the information of DUID connected to the letter is recorded in main.json.
When delete agent
Reservation is canceled when an SCSI Lock agent is deleted. When you delete it, you must consider the fact that the shared disk to be reserved can be used at the other node. In other words, when you delete it, you must make sure the other node is down.
When problems occur in MCCS, support file must be collected to collect log and preference information.
There are 2 ways to collect support file.
Click 'OK' button and support file is collected.
[Figure] Support Files Being Collected
It may take several minutes depending on the log file capacity and the network condition. |
Script file is located as below:
%MCCS_HOME%\bin\Support\support.cmd |
This way can only collect information from the running node. |
Collected support file is created in the following directory.
%MCCS_HOME%\support-%COMPUTERNAME%\%COMPUTERNAME.zip |
If the support file exists, new file will be over-writed, so please be aware. |