...

Section

Column

After configuring redundancy environment using MCCS, some failures might occur.
This chapter will explain how MCCS detects the failure and administrates after failure or failover is done.
(In the following example, the operating server as 'Active', standby server name as 'Standby' is registered on MCCS.)

Column

width	350px

Panel

Table Of Contents

Table of Contents

maxLevel	4

Server Failure

This is the case of system being rebooting or shut down because of conflicts of each device (NIC, Raid Controller), kernel driver problem of other application.

Active Server Failure

There is no difference in the MCCS role resulting from abnormal or normal termination of the server. MCCS will perform a failover to the standby server when the operation server fails.
In the node management menu on the right side of the screen, select the server. You can check the details of failures in the '"Resource Status' " & '"Resource Dependency' screens" screens.
- Normal Termination of a system
  This is a case where user selected 'system shutdown' in operating systems.
- Abnormal Termination of a system
  This is a case where system is terminated or rebooted due to an unexpected situation or blue screen.
[ Figure]Failure in Active Server
Since data cannot be replicated due to the server failure, will be shown in the mirror disk resource.
Server operators check on the failure and put the server back to normal.
After checking on the mirror role of two servers when server with the failure is rebooted, switch the server with the failure as replication target and proceed partial resync.

Standby Server Failure

MCCS will show the failure when failure occurs in standby server.
Data replication will be paused until standby server is back to normal.

[ Figure] Failure in Standby Server
If I/O keep happens, data is impossible to replicate and mirror disk will be shown as 'Paused'()
If there is no I/O, icon of mirror disk has no change but failure messages related mirror disk exists in MCCS log.
Even if the standby server failed, it does not affect operation. But as there is no server to perform failover to, the server operator must check the trouble in the MCCS web console and make sure that the standby server is normalized in time.
When standby server is back to normal, it will recover from 'Paused' to 'Normal' and icon will be disappeared.

Application, Process and Window Service Failure

Active application, process and window service resources are operated by 4 elements below.

MonitorInterval(Default Value=10sec)
Monitors the resource with interval set value.
MonitorTimeout(Default Value=10sec)MonitorTimeout
If there is no reply as much as the set value, it is considered as a failure.
RestartLimit(Default Value=10sec0)RestartLimit
It will restart the application resource as the set value.
OnlieTrustTime(Default Value=0600sec)OnlieTrustTime
It re-sets the time of number of resource restarting number.It is the time to reset the frequency of the resource to restart. (Default Value=600sec)
Attributes above are the set value of the registered being added the resource, and users can check or change the values through Resource Attribute view of MCCS console.

[Figure] Resource attribute value Edit

MCCS periodically monitors the resources referring 'MonitorInterval'.
If there is no response as the time set in 'MonitorTimeout', it is considered as a failure.
If there are no response after sending the command as the number set in 'RestartLimit', MCCS will failover the group which resource belongs to.
If the resource stays in normal state within the time limit set by 'OnlineTrustTime'. MCCS will initialize the attribute value of 'RestartLimit'. This is to ensure restart number when failure occurs in a resource.
If there is a failover due to a failure in the resource, server operator checks on the problem and put it back to normal.
In the MCCS web console, a user can see where the trouble occurs. After a user checks the trouble area, they must remove the Trouble sign, so that the failover function can be activated again.
After checking on the mirror role of two servers when server with the failure is rebooted, switch the server with the failure as replication target and proceed partial resync.

[Figure] Failure in Resource Clear

Network Failure

Network failure happens when network connection has problem, such as network switch or network interface card is broken or disconnection in network cable, or ping timeout of some network and so on.

Warning
※ Since MCCS license referenced to MAC address, license should be reissued if there is a change in network interface card.

Service Network Failure
If failure occurs in service network of active server, the fault mark will be shown on the network interface card resource or IP address of the node in MCCS web console, and will failover to the standby server.

[Figure] Failure in Network Interface Card

In the MCCS web console, you can check in which part of service network, trouble has occurred.
MCCS checks network cable disconnection of server where network failure occurred, and whether ping timeout occurs from network.
If IP address resource is the cause of the failure, user should check on the network switch or network interface card.
When physical parts related to network is back to normal, select 'Clear Fault' from the MCCS console and remove fault mark in order to re-enable the failover function.
If you want the sign of failures to be removed automatically, enter a positive number in AutoFaultClearTime of the group attribute.

Heartbeat Network Fault
Heartbeat should be dualized because it plays a very important role of synchronizing the inter node status and determining the condition of failure. If any one of the dualized heartbeat network fails, the details of failure is displayed in the log window.
However, the MCCS web console has no changes. It means that the operation server or the standby server has no problems.
At this point, when failure occurs in active server and needs to failover to the standby server, MCCS will use redundant normal heartbeat network to failover.
If all the redundant heartbeat is disconnected, MCCS will use the service network as heartbeat line.

...

Heartbeat failure can be checked on MCCS log, Window System log. If failure occurs in heartbeat line, server operator should check on the TCP/IP of server, physical connection check on the heartbeat through ping test.
If it is an abnormal situation, check on card, cable connection or cable disconnection and clear the cause of the failure.

Replication(Mirroring) Network Failure
When failure occurs in replication network, data cannot be replicated and it will be shown as 'Paused()' in mirror disk resource of MCCS web console.

[Figure] Failure in Replicated Network

Replication network failure history can be checked on MCCS log, Window System log. If failure occurs in replication network, server operator should check on the TCP/IP of server, physical connection check on the replication network through ping test.
If it is an abnormal situation, check on card, cable connection or cable disconnection and clear the cause of the failure.

Single Network Switch Fault
When failure occurs in network switch connected to Public Network where it is configured by single network switch, all the resources in active and standby server will be taken offline, resources where failure occurs will show as 'fault'.

[Figure] Failure in Network Switch

Network switch failure can be checked on MCCS log, Window System log. If failure occurs in service network connection, server operator should check on the TCP/IP of server, physical connection check on the service network through ping test.
If you want the sign of failures to be removed automatically, enter a positive number in AutoFaultClearTime of the group attribute.
Please get the supports regarding the recovery of Network switch failure through manufacturer.

Disk Failure

...

Mirror Disk Failure

Source Disk Failure
If failure occurs in disk resource of active server, MCCS web console will show the failure.
MCCS will failover to the standby server since it is impossible to Read/Write in the disk.
Image Removed
[Figure] Failure in Mirror Disk
fa

Availability of disk monitoring of MCCS are as below.
- Periodic read/write test on the disk.
- Determines whether drive letter exists in the disk.
Disk failure can be caused by the following. After resolving the above issues, the OS will detect the newly changed disk again. After that, Datakeeper will proceed with resynchronization.
- Disk controller problems or H/W problems should be fixed by the manufacturers.
- Physical disk problems or H/W problems should be fixed by the manufacturers.
If the mirror disk does not perform synchronization, delete the mirror disk resource and try to create it again. But when you delete the resource, you must also delete the created mirror and create them again.

...

When MCCS detects failures of the target disk, it will only determine whether the disk has drive characters.
Disk failure can be caused by the following. After resolving the above issues, the OS will detect the newly changed disk again.
After that, Datakeeper will proceed with resynchronization.
- Disk controller problems or H/W problems should be fixed by the manufacturers.
- Physical disk problems or H/W problems should be fixed by the manufacturers.
If the mirror disk does not perform synchronization, delete the mirror disk resource and try to create it again. But when you delete the resource, you must also delete the created mirror and create them again.

Split Brain of Mirror Disk Resource
This happens rarely but mirror disks identifies as source on both servers. This happens in the process of changing from existing source cannot be changed to target source.
Both servers will try to synchronize the data and that cause the split brain. Split brain occurs in the situation as shown below.

...

Check the resource attribute view.
Image RemovedImage Added
[Figure] Verify SplitBrain of MirrorDisk

Check the mirror management view.

[Figure] Checking Mirror Disk Split Brains

Warning

1) The both nodes' MirrorRole is Source, and their MirrorState is MIRROR_PAUSED.
2) Check the mirror disk's TimeAquiredSourceRole. (TimeAquiredSourceRole is the system time. So, it is not the absolute value used to determine whether it is the latest data.)
3) When a split brain occurs, the log will be displayed.
(Windows event error: An invalid attempt to establish a mirror occurred. Both systems were found to be Source.
Local Volume: F Remote system: 200.200.124.49 Remote Volume: F The mirror has been paused, or left in its current non-mirroring state.
Use the DataKeeper User Inteface to resolve this Split Brain condition.)
4) In the mirror management window, the mirror condition is set to 'SPLIT'.

In the Group tab of the configuration tree, right click mirrordisk resource and you can select the source node when you place the cursor on the "Resolve Split Brain" button.

[Figure] Split Brain Resolving Selected
Display the window to explain split brains.

[Figure] Checking the Source Node Selection
Select the source node.

[Figure] Source Roll Node Selection
Recheck the selected source node.

[Figure] Rechecking the Source Node Selection
Split brains problems being resolved.

[Figure] Split Brain Resolved
Resolving split brains problems is finished.

[Figure] Resolving Split Brain Finished
The selected node becomes the source node and the mirror disk condition is changed to MIRRORING.

[Figure] Split Brain Resolved

Warning
The changed information of node B will be all overwritten.

External Storage Failure

When the external disk fails or has a bad connection path, you cannot read/write the disk. So, MCCS will display the sign of failure and proceed with a failover.

...

External storage failure can be checked through MCCS log, System log.
If there is a problem in external storage, service is stopped until the storage recovers. Therefore, storage should be recovered in a short period of time or it should be replaced to other one (back up storage).
Problems related to the external storage should be dealt with the vendor.
When the server of external storage connection and disk where failure occurs is back to normal, Server should be rebooted so that MCCS Kernel Driver can identify the recovered environment.
Also, redundancy measures should be solved from storage vendor.

NetBIOS Failures

Use Direct-Hosted SMB

SMB which is supported on Windows 2000 or later uses Direct-Hosted. This feature is support directly file sharing service without NetBIOS interface.
To resolve name resolution for an IP address, DNS lookup occurs and not used NetBIOS name resolution.

...

Namely, if you want to work with the DNS server while using NetBIOS agent, most of clients are connected by Direct-Hosted SMB.

Related Cache Flush

When verify agent action, related cache will be flushed.

...

Flush an ARP cache

Code Block
arp -d

Turn off firewall settings

Turn off NetBIOS communication related are : the destination port number.

...

Turn off DNS Server update, WINS Server update communication related are : the destination port number.

Panel
TCP/UDP 42, 53

Considerations when Workstation Service is stopped

The Workstation Service on Windows Services creates and maintains client connection for the remote server using SMB protocol.
If this service is stopped, cannot keep the connections. if this service is disabled, then this connection explicitly using the service cannot be started.
When the workstation service is stopped, you must be careful.

Service Name
Alerter Service
Browser Service
Messenger Service
Net Logon Service
RPC Locator Service

Considerations of Server Service stop

Windows Server service supports sharing the file, print, and named pipe over the network for this computer. If this service is stopped, cannot use these features. If this is disabled, the following services have a dependency on this service will not be able to start.

...

Service Name
Browser Service

When file sharing is not working

At first, please verify file sharing is working as original NetBIOS computer name except virtual name.
On the client, please verify that access to node file on a regular basis is working using command like dir, start, explorer or net view.

Please verify as DIR command.
DIR command is run by following syntax.

...

Your computer's file and print sharing lists are created. On the specified computer, there are no file or print shares available, "there are no entries in the list" message.
When the client isn't refreshed the mapping information between virtual name and real IP address after failover occurs, the client's NetBIOS cache is not communication for a few minutes until flushed.
This case will be happened when you use WINS server. Therefore the clients program is needed to be cluster aware in this case.

SCSI Lock Failure

When interlock with volume manager using SCSI3-PR

...

Reservation is canceled when an SCSI Lock agent is deleted. When you delete it, you must consider the fact that the shared disk to be reserved can be used at the other node. In other words, when you delete it, you must make sure the other node is down.

Ways to collect support files

When problems occur in MCCS, support file must be collected to collect log and preference information.
There are 2 ways to collect support file.

How to collect by using the web console

In the MCCS web console, click the 'File' on the menu bar to collect support files.

[Figure] Support file Collect Icon 1
Support files can be collected by clicking the toolbar shown in the figure below.

[Figure] Support File collect icon 2
You can select a node to collect support files from and get the previous support file again.

[Figure] Support File Node Selection and Previous Support File Selection

Click the 'OK' button and support file is collected.

[Figure] Support Files Being Collected

Info
It may take several minutes depending on the log file capacity and the network condition.

Info

If the download window does not open in Internet Explorer

1. In the IE Internet Options, Click Security -> Internet -> Custom Level.

2. In the Downloads tab, check enable 'Automatic prompting for file downloads', 'File download'.

As shown below, you can download it from the download window.

[Figure] Support Files Collection Checked

Collecting file using script files

Script file is located as below:

...

Versions Compared

Old Version 1

New Version 2

Key

Server Failure

Active Server Failure

Standby Server Failure

Application, Process and Window Service Failure

Network Failure

Service Network Failure

Heartbeat Network Fault

Replication(Mirroring) Network Failure

Single Network Switch Fault

Disk Failure

Mirror Disk Failure

Image Removed

Split Brain of Mirror Disk Resource

External Storage Failure

NetBIOS Failures

Use Direct-Hosted SMB

Related Cache Flush

Turn off firewall settings

Considerations when Workstation Service is stopped

Considerations of Server Service stop

When file sharing is not working

SCSI Lock Failure

Ways to collect support files

How to collect by using the web console

Collecting file using script files

Page Comparison

Versions Compared

Old Version 1

New Version 2

Key

Server Failure

Active Server Failure

Standby Server Failure

Application, Process and Window Service Failure

Network Failure

Service Network Failure

Heartbeat Network Fault

Replication(Mirroring) Network Failure

Single Network Switch Fault

Disk Failure

Mirror Disk Failure

Image Removed

Split Brain of Mirror Disk Resource

External Storage Failure

NetBIOS Failures

Use Direct-Hosted SMB

Related Cache Flush

Turn off firewall settings

Considerations when Workstation Service is stopped

Considerations of Server Service stop

When file sharing is not working

SCSI Lock Failure

Ways to collect support files

How to collect by using the web console

Collecting file using script files