Section

Column

MCCS를 사용하여 서버 이중화를 구성한 후, 서비스 운영 중에 여러 가지 장애가 발생할 수 있습니다.

이 장에서는 MCCS가 어떻게 장애를 감지하고 대처하는지에 대해 다음의 예제에서 상세하게 설명합니다.

(다음 예제에서 운영 서버는 'Active', 대기 서버는 'Standby'란 이름으로 MCCS에 등록되어 있습니다.)

Column

width	350px

Panel

이 페이지의 주요 내용

Table of Contents

maxLevel	4

EMS(Emergency Message Service)의 활용

MCCS는 별도 상용 제품인 EMS(Emergency Message Service)를 통해서 해당 시스템의 심각한 오류나 장애가 발생했을 경우, 자동으로 서버 관리자와 MCCS 제품 서비스 담당자에게 문자 메시지를 전송합니다.

또한 웹 기반의 통합 관제 콘솔을 제공함으로써 인터넷을 통하여 언제 어디서든지 장애 현황을 한 눈으로 파악할 수 있으며, 과거의 장애 이력 검색, 관리, 보고서 작성 등을 쉽게 할 수 있습니다.

EMS 시스템에 대한 상세한 사용방법은 "EMS 사용자 안내서 및 EMS Agent 설치 안내서"를 참조해 주십시오.

EMS 구성요소

EMS 에이전트

MCCS가 설치되는 서버에서 동작하는 프로그램으로 EMS 서버와 통신을 합니다.

EMS 서버

MCCS 제품 유지 보수를 담당하는 업체에 설치된 서버 프로그램을 말합니다.

EMS 업무 흐름

로그 저장

EMS 에이전트가 로그를 저장합니다.

다음과 같은 LogType 속성을 이용하여 EMS 서버에 저장될 로그 종류를 지정할 수 있습니다.

H

HA 관련 로그로 저장됩니다.

(파일감시만 지정 가능합니다.)

A

애플리케이션 관련 로그로 저장됩니다.

(파일감시만 지정 가능합니다.)

S

Windows 시스템 이벤트 로그로 저장됩니다.

(Windows event감시만 지정 가능합니다.)

P

프로세스 관련 로그로 저장됩니다.

(파일감시만 지정 가능합니다.)

로그 분석

EMS 서버 이용자는 EMS 서비스를 받고자 하는 시스템에 대하여 장애 레벨을 설정할 수 있습니다.

EMS 서버는 설정된 장애 레벨을 필터로 사용해서 EMS 에이전트가 실시간으로 전송한 운영 서버의 시스템 로그 및 MCCS 로그를 분석하여 장애 여부를 판단합니다.

SMS 통보

정해진 필터에 의해 장애 감지가 확인되면 EMS는 시스템 담당 운영자와 MCCS 제품 서비스 담당자의 휴대폰에 장애 내용을 SMS(문자메시지)로 통보하여 신속하게 대응할 수 있도록 합니다.

EMS 서버 접속 후, 장애 원인 분석

시스템 담당 운영자 및 MCCS 제품 서비스 담당자는 인터넷 연결이 가능한 곳에서 EMS 서버에 접속하여 장애가 발생한 서버의 로그를 살펴보고 장애 원인을 분석합니다.

또한 제조업 관련 고객사인 경우, EMS 서버를 별도로 구축하면 제품 제조 공정에서 운영되고 있는 수많은 이중화 서버들을 중앙에서 모니터링 할 수 있으며, 통계 정보를 통하여 기간별 장애 유형과 장애 처리 내용들을 검색할 수 있습니다.

다음은 EMS 시스템의 업무 흐름을 도식화 한 것입니다.

Image Removed

[그림] EMS 시스템의 업무 흐름도

EMS 서버의 통합 모니터링 화면

다음은 EMS 서버에서 제공하는 통합 모니터링 화면의 일부입니다.

장애가 발생한 서버는 붉은 색으로, 장애 발생 후에 서버 관리자가 이를 인지하고 정상화 과정에 있는 서버는 노란색으로 표시되며, 파란색으로 표시된 서버는 정상 상태임을 나타냅니다.

물론 EMS 서버에 등록된 사용자는 해당 사용자가 관리하는 서버에 대해서만 내용을 확인할 수 있습니다.

Image Removed

[그림] EMS 시스템의 이중화 서버 모니터링 화면

Image Removed

[그림] EMS 시스템의 통계 화면

서버 장애

각 장치(NIC, Raid Controller)의 드라이버 충돌, 기타 응용프로그램의 커널 드라이버 문제 등으로 시스템이 자동 재부팅 또는 셧다운 되는 경우입니다.

운영 서버 장애

...

정상 종료 사용자가 운영체제에서 '시스템 종료'를 선택한 경우를 말합니다.
비정상 종료 블루스크린 또는 예기치 않은 상황으로 인해 시스템 종료 혹은 재부팅 되는 경우를 말합니다.

...

대기 서버 장애

...

응용프로그램 장애

운영중인 응용프로그램 리소스에 대하여 MCCS는 다음과 같은 4가지 요소에 의해 동작을 하게 됩니다.

...

네트워크 장애

네트워크 장애는 네트워크 스위치나 네트워크 카드 고장, 네트워크 케이블 단절 또는 특정 네트워크로의 Ping 타임아웃 발생 등의 문제로 통신 장애가 발생되는 경우를 말합니다.

Warning
※ MCCS 라이선스는 MAC 어드레스를 참조하기 때문에 네트워크 카드가 변경되면 라이선스를 재발급 받아야 합니다.

서비스 네트워크 장애

운영 서버에서 서비스 네트워크 장애가 발생하면 MCCS UI의 네트워크 주소 또는 네트워크 카드 리소스부분에 장애 표시가 되고 대기 서버로 페일오버를 진행합니다.

Image Removed
[그림] 네트워크 카드 장애 표시 화면

...

핫빗 네트워크 장애

핫빗은 노드 상호간의 상태를 동기화하고 장애 상태를 결정하는 중요한 역할을 하기 때문에 반드시 이중화되어 있어야 합니다. 이중화된 핫빗 네트워크 중에서 어느 하나라도 장애가 발생하면 장애 내용은 로그창에 표시 됩니다.

하지만 MCCS GUI 부분에는 아무런 변화가 나타나지 않습니다. 이것은 운영 서버와 대기 서버에는 아무런 문제가 없다는 것을 뜻합니다.

이 때, 운영 서버에 문제가 생겨 대기 서버로 페일오버 해야 할 상황이 발생하면 MCCS는 장애가 발생하지 않은 정상적인 핫빗 네트워크를 이용하여 페일오버를 진행하게 됩니다.

만일 이중화된 핫빗 모두가 단절된 상황이라면 MCCS는 서비스 네트워크를 이용하여 핫빗 통신을 하게 됩니다.

Image Removed

...

복제 네트워크 장애

복제 네트워크에 장애가 발생하면 데이터 복제를 진행할 수 없으며, MCCS 콘솔의 미러 디스크 리소스 부분이 'Pause' 상태로 표시 됩니다.

Image Removed
[그림] 복제 네트워크 장애 발생 화면

복제 네트워크 장애는 MCCS 로그, 윈도우 시스템 로그에서 장애내역을 확인할 수 있습니다. 복제 네트워크 장애가 발생하면 서버 관리자는 서버의 TCP/IP, 물리적인 연결상태 및 ping 테스트를 통해 복제 네트워크가 정상적인 상태인지를 확인해야 합니다.
만약 비정상적인 상황이라면 카드 불량, 케이블 연결 불량 혹은 케이블 단선 등을 확인하고 장애 원인을 제거해야 합니다.

단일 네트워크 스위치 장애

단일 네트워크 스위치로 구성된 환경에서 Public Network에 연결된 네트워크 스위치에 장애가 발생하면 운영 서버 및 대기 서버의 모든 그룹 리소스가 오프라인이 되며, 장애가 발생한 리소스는 '장애' 상태로 표시 됩니다.

Image Removed
[그림] 네트워크 스위치 장애 발생 화면

...

디스크 장애

미러 디스크 장애 (내장 스토리지)

소스 디스크 장애

운영 서버의 미러 디스크 리소스에 장애가 발생하면 MCCS GUI에 장애가 표시됩니다. MCCS는 해당 디스크에 Read/Write가 불가능한 상황이므로 장애로 인식하여 대기 서버로 페일오버를 진행합니다.

Image Removed

[그림] 미러 디스크 장애 발생 화면

...

해당 디스크에 대해서 Read/Write를 실행합니다.
해당 디스크의 마운트 유무를 판단합니다.

...

디스크 컨트롤러 문제 하드웨어 자체의 문제는 해당 업체에서 해결해야 합니다.
물리적인 디스크 문제 하드웨어 자체의 문제는 해당 업체에서 해결해야 합니다.

...

디스크 컨트롤러 문제 하드웨어 자체 문제는 해당 업체에서 해결해야 합니다.
물리적인 디스크 문제 하드웨어 자체 문제는 해당 업체에서 해결해야 합니다.

...

미러 디스크 리소스의 Split Brain
매우 드문 경우이지만 두 서버상에서 미러 디스크 역할이 모두 소스로 인식되는 경우입니다.
이러한 상황은 타깃이 소스로 변경되는 시점에 기존 소스가 타깃으로 변경되지 못한 결과이며, 이 때는 서로 자신의 데이터를 동기화하려 하지만 데이터를 받는 상대 서버 역시 자신이 소스 서버인 상태이므로 역할 변경을 거부하게 됩니다.
미러 디스크에서 Split Brain이 발생하는 상황은 다음과 같습니다.

...

MCCS 웹 콘솔을 사용해서 스플릿브레인을 해결하는 방법

리소스 속성창을 확인합니다.
Image Removed
[그림] 미러디스크 스플릿 브레인 확인

...

미러관리 창을 확인합니다.

Image Removed
[그림] 미러디스크 스플릿 브레인 확인

Warning

1)양노드의 ConnectState 는 StandAlone 이며, DiskState은 UpToDate와 Outdated 상태가 됩니다.

2) 미러디스크의 LastMirrorOnlineTime을 확인합니다. (LastMirrorOnlineTime은 시스템의 시간이므로 최신 데이터의 유무를 결정할 수 있는 절대값은 아닙니다)

3) 스플릿 브레인이 발생했을 때 발생하는 로그가 출력됩니다.

(DRBD 볼륨(r0)에 스플릿브레인이 발생했습니다.)

4) 미러관리 창에서 미러 상태가 'SPLIT' 상태 입니다.

...

Warning
노드 B 의 변경된 정보는 모두 덮어써지게 됩니다

외장 스토리지 장애

외장 디스크의 연결 경로 및 디스크에 장애가 발생하면 해당 디스크의 Read/Write가 불가능하므로 MCCS는 장애를 표시하고 페일오버를 진행합니다.

Image Removed

[그림] 공유 디스크 장애 발생 화면

외장 스토리지 장애는 MCCS 로그, 시스템 로그에서 장애 내역을 확인할 수 있습니다.

외장 스토리지 자체에 문제가 생기면 스토리지를 복구 할 때까지 서버 운영이 중단 됩니다. 따라서 빠른 시간 내에 스토리지를 복구하거나 임시 스토리지(백업 스토리지)로 교체해서 사용해야 합니다.

외장 스토리지와 관련된 장애는 해당 스토리지 업체에 문의하시기 바랍니다.

장애가 발생한 서버의 외장 스토리지 연결 및 디스크가 정상화 되면 MCCS 커널 드라이버가 복구된 환경을 다시 인식할 수 있도록 MCCS의 서버를 재부팅 해야 합니다.

또한 해당 스토리지 벤더를 통해 스토리지의 이중화 대책을 마련해야 합니다.

SCSI Lock 장애

SCSI3-PR 을 사용하는 볼륨매니저와 연동할 때

볼륨 매니져(예: 시멘택의 SFW 같이 SCSI3-PR 예약 기능을 사용하는 제품)는 SCSI Lock 에이전트와 같이 사용할 수 없습니다.

SCSI3-PR을 지원하는지 확인할 때

디스크가 SCSI3-PR 기능을 지원하는지 scsicmd.cmd 명령으로 PR 타입을 확인합니다.

sg_scan.exe 또는 sg_persist.exe 패스를 못 찾을 때

해당명령이 %MCCS_HOME%/bin 에 존재하는지 확인합니다.

공유 디스크 에이전트와 연동할 때

공유디스크 에이전트와 SCSI Lock 에이전트를 연동할 경우, 공유디스크 에이전트 동작이 정상임을 확인 한 후에 SCSI Lock 에이전트를 등록합니다.

이때 SCSI Lock 에이전트의 디스크 사용목적은 H/W적인 LOCK 장치로 사용하기 위함이지 디스크의 내용을 사용하지는 않습니다. 따라서 디스크의 크기는 작아도 되며 내용은 보호하지 않습니다.

등록 키 충돌 오류가 날 때

scsicmd.cmd -c 옵션 혹은 scsicmd.cmd -cf 옵션 으로 예약키 또는 등록키를 모두 제거한 후 다시 설정해야합니다. 그리고 리소스 등록전에는 등록된 키는 없는지 확인하고 있으면 제거후 등록해야 합니다.

참고로 현재 키는 노드의 MAC 주소를 사용하여 자동으로 설정합니다. 여러개의 네트웍 어탭터중에서 첫 번째 어탭터의 MAC 주소를 사용합니다. 이 키는 설정파일에 자동 기록합니다. 설정파일에 키가 존재할 경우에는 키를 새로 만들지 않습니다.

하나의 디스크에 여러 레터가 존재할 경우 한개의 레터에 예약 시 나머지에 레터에 접근을 못할 때

SCS Lock 대상 디스크는 기본 디스크와 단일 레터를 지원합니다. 동적 디스크나 다중 볼륨(하나의 LUN에 여러 파티션을 구성하는 경우)을 사용하는 디스크는 사용하지 마십시요.

에이전트 등록 후 DUID가 해결되지 못한 상태로 유지될 때

레터를 정의하고 활성화를 요청해야 main.json에 해당 레터에 연결된 DUID 정보가 기록됩니다.

에이전트 삭제할 때

SCSI Lock 에이전트가 삭제될때 예약을 해제합니다. 따라서 예약대상 공유디스크가 상대 노드에서 사용될 가능성을 염두에 두고 삭제를 해야 합니다. 즉 삭제할 경우에는 상대 노드를 다운시킨 후 작업하십시요.

서포트 파일을 수집하는 방법

MCCS 에 문제가 생겼을 경우 로그와 환경 정보의 수집을 위해서 서포트 파일을 수집할 수 있습니다.

서포트 파일을 수집하는 방법은 다음과 같이 2가지가 있습니다.

콘솔을 통해서 수집하는 방법

MCCS 의 콘솔 중 아래와 같은 툴바의 아이콘을 클릭하면 서포트 파일을 수집할 수 있습니다.

Image Removed

[그림] 서포트 파일 수집 아이콘1

Image Removed

[그림] 서포트 파일 수집 아이콘2

서포트 파일을 수집할 노드의 선택과 이전에 받은 서포트파일을 다시 받을 수 있습니다.

Image Removed

[그림] 서포트 파일 노드 선택 및 이전 서포트 파일 선택 여부

확인 버튼을 누르면 서포트 파일을 수집합니다.

Image Removed

[그림] 서포트 파일 수집 중 화면

Info
로그파일의 용량과 네트워크의 상태에 따라서 몇 분이 걸릴 수도 있습니다.

아래와 같이 다운로드 창이 열리게 되고 다운받으시면 됩니다.

Image Removed

수집된 서포트 파일은 지정된 위치에서 확인할 수 있습니다.

Image Removed

[그림] 서포트 파일 수집 확인

스크립트 파일을 실행해서 수집하는 방법

...

Section

Column
After configuring redundancy environment using MCCS, some failures might occur. This chapter will explain how MCCS detects the failure and administrates after failure or failover is done. (In the following example, the operating server as 'Active', standby server name as 'Standby' is registered on MCCS.)

Column

width	350px

Panel

Table of Contents

Table of Contents

maxLevel	4

How to use EMS(Emergency Message Service)

MCCS has a bundled product called EMS(Emergency Message service) that automatically sends SMS to the defined admin members in charge of critical events.
In addition, since console is web-based management, whenever an error or fault occurs, it can be managed anywhere that has internet service. Plus, Failures records in the past, management, reporting are all very easy to use.

EMS Component

EMS Agent

It is a program installed in the server to connect with EMS server.

EMS Server

It is an installed server program from the product provider company of MCCS.

EMS Workflow

Save Log

EMS Agent saves logs.
EMS server can specify logs by its type using 'LogType' attribute as shown below.

H

It saves the logs related to HA (MCCS).
(It can only specify file monitor.)

A

It saves logs related to application.
(It can only specify file monitor.)

S

It saves event log of Windows system.
(It can only specify Windows event monitor.)

P

It saves log related to process.
(It can only monitor specified file.)

Log Analysis

EMS Server users can set failure level of the system that wants to receive EMS service.
EMS server uses failure level that is set to filter EMS Agent system of operating server and analysis log to determine if it is a failure

SMS Notification

After failure monitoring for given filter is checked, EMS will send the SMS to the system operator and MCCS server operator so that it can be dealt quickly.

After connecting to EMS server, analysis cause of failure

System operator and MCCS service operator can access to the EMS server where anywhere with internet connection to check on the log and analyze the cause.
In addition, in case of manufacturing customer, it provides a centered monitoring system for all the servers in the factory and also provides a statistic of periodical failure type and trouble-shooting solutions.
The following graph is the workflow if EMS system.

Image Added

[Figure] Workflow of EMS System

Control Monitoring of EMS Server Consolidated Web-based dashboard of EMS Server

Following is a part of consolidated web-based dashboard of EMS Server.
Servers with failures are shown in red, servers that had failure and had notified to the server operators are shown in yellow, and servers that operate normally are shown in blue.
Users registered in EMS server are the only ones that can monitor the dashboard.

Image Added

[Figure] Redundant server monitoring view of EMS system

Image Added

[Figure] Statistic view of EMS system

Server Failure

This is the case of system being rebooting or shut down because of conflicts of each device (NIC, Raid Controller), kernel driver problem of other application.

Active Server Failure

There is no difference in the MCCS role resulting from abnormal or normal termination of the server.
MCCS will perform a failover to the standby server when the operation server fails. In the node management menu on the right side of the screen, select the server.
You can check the details of failures in the 'Resource Status' & 'Resource Dependency' screens.
- - Normal Termination of a system
    This is a case where user selected 'system shutdown' in operating systems.
  - Abnormal Termination of a system
    This is a case where system is terminated or rebooted due to an unexpected situation or blue screen.
Image Added
[Figure] Failure in Active Server
Since data cannot be replicated due to the server failure, Image Addedwill be shown in the mirror disk resource.
Server operators check on the failure and put the server back to normal.
After checking on the mirror role of two servers when server with the failure is rebooted, switch the server with the failure as replication target and proceed partial resync.

Standby Server Failure

MCCS will show the failure when failure occurs in standby server.
Data replication will be paused until standby server is back to normal.

Image Added
[Figure] Failure in Standby Server
Data synchronization cannot be achieved. Mirror disk becomes the 'Network Connection Failure' state ( Image Added ).
Even if the standby server failed, it does not affect operation. But as there is no server to perform failover to, the server operator must check the trouble in the MCCS web console and make sure that the standby server is normalized in time.
When the standby server is normalized, the Image Added icon disappears.
Mirror disk's DiskState value is changed from 'Inconsistent' to 'UpToDate' and for this purpose, data synchronization (Image Added) is performed.
When the synchronization is finished, the current normal data is synchronized in real time. The icon is changed to Image Added.

Application Failure

Active application resources are operated by 4 elements below.

MonitorInterval (Default Value=10sec)
Monitors the resource with interval set value.
MonitorTimeout (Default Value=10sec)
If there is no reply as much as the set value, it is considered as a failure.
RestartLimit (Default Value=0)
It will restart the application resource as the set value.
OnlieTrustTime (Default Value=600sec)
It re-sets the time of number of resource restarting number.It is the time to reset the frequency of the resource to restart.
Attributes above are the set value of the registered being added the resource, and users can check or change the values through Resource Attribute view of MCCS console.

Image Added
[Figure] Resource attribute value Edit

MCCS periodically monitors the resources referring 'MonitorInterval'.
If there is no response as the time set in 'MonitorTimeout', it is considered as a failure.
If there are no response after sending the command as the number set in 'RestartLimit', MCCS will failover the group which resource belongs to.
If the resource stays in normal state within the time limit set by 'OnlineTrustTime'. MCCS will initialize the attribute value of 'RestartLimit'. This is to ensure restart number when failure occurs in a resource.
If there is a failover due to a failure in the resource, server operator checks on the problem and put it back to normal.
In the MCCS web console, a user can see where the trouble occurs. After a user checks the trouble area, they must remove the Trouble sign, so that the failover function can be activated again.
If you want the sign of failures to be removed automatically, enter a positive number in AutoFaultClearTime of the group attribute.
After checking on the mirror role of two servers when server with the failure is rebooted, switch the server with the failure as replication target and proceed partial resync.

Image Added
[Figure] Failure in Resource Clear

Network Failure

Network failure happens when network connection has problem, such as network switch or network interface card is broken or disconnection in network cable, or ping timeout of some network and so on.

Warning
※ Since MCCS license referenced to MAC address, license should be reissued if there is a change in network interface card.

Service Network Failure
If failure occurs in service network of active server, the fault mark will be shown on the network interface card resource or IP address of the node in MCCS UI, and will failover to the standby server.
Image Added
[Figure] Failure in Network Interface Card

In the MCCS web console, you can check in which part of service network, trouble has occurred.
MCCS checks network cable disconnection of server where network failure occurred, and whether ping timeout occurs from network.
If IP address resource is the cause of the failure, user should check on the network switch or network interface card.
When physical parts related to network is back to normal, select 'Clear Fault' from the MCCS web console and remove fault mark in order to re-enable the failover function.
If you want the sign of failures to be removed automatically, enter a positive number in AutoFaultClearTime of the group attribute.

- Heartbeat Network Fault
  Heartbeat should be dualized because it plays a very important role of synchronizing the inter node status and determining the condition of failure. If any one of the dualized heartbeat network fails, the details of failure is displayed in the log window.
  However, the MCCS web console has no changes. It means that the operation server or the standby server has no problems.
  At this point, when failure occurs in active server and needs to failover to the standby server, MCCS will use redundant normal heartbeat network to failover.
  If all the redundant heartbeat is disconnected, MCCS will use the service network as heartbeat line.
  Image Added
  [Figure] Failure in Heartbeat

Heartbeat failure can be checked on MCCS log, Window System log. If failure occurs in heartbeat line, server operator should check on the TCP/IP of server, physical connection check on the heartbeat through ping test.
If it is an abnormal situation, check on card, cable connection or cable disconnection and clear the cause of the failure.

- Replication (Mirroring) Network Failure
  When the copying network failed, data copying cannot be done. The mirror disk resource of MCCS web console displays the 'Disconnect' ( Image Added).
  Image Added
  [Figure]Failure in Replicated Network

Replication network failure can be checked on MCCS log, OS System log. If failure occurs in replication network, server operator should check on the TCP/IP of server, physical connection check on the replication network through ping test.
If it is an abnormal situation, check on card, cable connection or cable disconnection and clear the cause of the failure.

- Single Network Switch Fault
  When failure occurs in network switch connected to Public Network where it is configured by single network switch, all the resources in active and standby server will be taken offline, resources where failure occurs will show as 'fault'.
  Image Added
  [Figure] Failure in Network Switch

Network switch failure can be checked on MCCS log, OS System log. If failure occurs in service network connection, server operator should check on the TCP/IP of server, physical connection check on the service network through ping test.
If you want the sign of failures to be removed automatically, enter a positive number in AutoFaultClearTime of the group attribute.
Please get the supports regarding the recovery of Network switch failure through manufacturer.

Disk Failure

Mirror Disk Failure

Source Disk Failure
If failure occurs in disk resource of active server, MCCS GUI will show the failure. MCCS will failover to the standby server since it is impossible to Read/Write in the disk.
Image Added
[Figure] Failure in Mirror Disk

Availability of disk monitoring of MCCS are as below.
- Periodic read/write test on the disk.
- Determines whether drive letter exists in the disk.
Disk failure can be caused by the following. After resolving the above issues, the OS will detect the newly changed disk again. After that, DRBD will proceed with synchronization.
- Disk controller problems or H/W problems should be fixed by the manufacturers.
- Physical disk problems or H/W problems should be fixed by the manufacturers.
If the mirror disk does not perform synchronization, delete the mirror disk resource and try to create it again. But when you delete the resource, you must also delete the created mirror and create them again.

Target Disk Failure
If the disk at the standby server failed, the disk resource icon of MCCS web console is not changed. But the attribute values of Diskstate is changed from UptoDate to Diskless. However, the service running in the source server is not affected by it.
Image Added
[Figure] Failure in Target Disk

When MCCS detects failures of the target disk, only the DiskState value of the disk is displayed.
Disk failure can be caused by the following. After resolving the above issues, the OS will detect the newly changed disk again. After that, DRBD will proceed with synchronization.
- Disk controller problems or H/W problems should be fixed by the manufacturers.
- Physical disk problems or H/W problems should be fixed by the manufacturers.
If the mirror disk does not perform synchronization, delete the mirror disk resource and try to create it again. But when you delete the resource, you must also delete the created mirror and create them again.

Split Brain of Mirror Disk Resource
This is a rare case but in this case, the mirror disk roles at the two servers are recognized as primary and the data on the web console is not matched.
The situation arises because the existing source is not switched to the target when the source and the target need to be switched. In this case, they try to synchronize their own data but due to the mismatch between the previous data, automatic synchronization failed.
In the mirror disk, a split brain can occur as follows.

Due to the failure of source server (A), a failover occurs.
The role of target server (B) is changed to Primary. (Mirror disk role changed)
Reboot the initial source server (A).
After the initial source server (A) boots, check the role of the target server (B).
Check the GI value for the both nodes.
Check if the GI data is matched, and if it is, proceed with data synchronization automatically. (5th/ 6th processes checking fails.)
GI data is not matched. So, synchronization is required on one node. No automatic synchronization is taking place. (Split brain occurred.)

When this state is reached, the icon of mirror disk resource is overlapped in the MCCS web console. Both the Image Added and the 'SplitBrainStatus' attribute values will be set to true.
In this case, you need to manually change the mirror disk role. After that, you need to resynchronize it.
If you want to manually change the mirror disk role, you need to use the MCCS web console.

How to resolve the split brain issues by using the MCCS web console

Check the resource attribute view.
Image Added
[Figure] Verify SplitBrain of MirrorDisk

Check the mirror management view.

Image Added
[Figure] Checking Mirror Disk Split Brains

Warning

1) The ConnectState of both node is StandAlone and SplitBrainStatus values are set to True.
2) Check LastMirrorOnlineTime on the mirror disk. (LastMirrorOnlineTime is the system time. So, it is not the absolute value used to determine whether it is the latest data.)
3) When a split brain occurs, the log will be displayed.
(DRBD volume (r0) has a split brain.)
4) In the mirror management window, the mirror condition is set to 'SPLIT'.

Select the mirror disk and right click with your mouse button and click on 'Resolve Split Brains'.
Image Added
[Figure] Split Brain Resolving Selected
Display the window to explain split brains.
Image Added
[Figure] Checking the Source Node Selection
Select the source node.
Image Added
[Figure] Source Roll Node Selection
Recheck the selected source node.
Image Added
[Figure] Rechecking the Source Node Selection
Split brains problems being resolved.
Image Added
[Figure] Split Brain Resolved
Resolving split brains problems is finished.
Image Added
[Figure] Resolving Split Brain Finished
The selected node becomes the source node and the mirror disk condition is changed from DiskState to UpToData.
Image Added
[Figure] Split Brain Resolve

Warning
The changed information of node B will be all overwritten.

External Storage Failure

When the external disk fails or has a bad connection path, you cannot read/write the disk. So, MCCS will display the sign of failure and proceed with a failover.

Image Added

[Figure] Failure in Shared Disk

External storage failure can be checked through MCCS log, System log.
If there is a problem in external storage, service is stopped until the storage recovers. Therefore, storage should be recovered in a short period of time or it should be replaced to other one (back up storage).
Problems related to the external storage should be dealt with the vendor.
When the server of external storage connection and disk where failure occurs is back to normal, Server should be rebooted so that MCCS Kernel Driver can identify the recovered environment.
Also, redundancy measures should be solved from storage vendor.

SCSI Lock Failure

When interlock with volume manager using SCSI3-PR

Volume Manager (Ex: something like SFW of Symantec that has SCSI3-PR reservation function) can be used with SCSI Lock agent.

When check if SCSI-PR is supported

To check of the disk supports SCSI-3PR function, PR type can be checked using scsicmd.cmd command.

When interlock with shared disk

When interlock shared disk agent and SCSI Lock agent, check if the shared disk agent works normally and then register SCSI Lock agent.
The purpose of disk of SCSI Lock agent is to use as a LOCK device in hardware perspective, not the contents of the disk. Therefore, size of disk can be small and it is not protected.

When registration key error occurs

Remove Reservation key and registration key using scsicmd.cmd-c command and re-set. Before registering resource, check if there is any registered key and if there is, remove the key first before registering.
Note that the current key is se automatically by its MAC address. It uses the first adapter among the network adapters. This key is automatically recorded in setting file. If key does not exist in setting file then new key is not created.

When various letters exist in one disk and when register one letter, other letters cannot access

SCSI Lock disk support single disk device. Please do not use the disk that uses multiple volume(use one LUN to configure various partition).

When maintaining the state where DUID is not solved after registering agent

You must first define the disk device and request activation before the information of DUID connected to the letter is recorded in main.json.

When delete agent

Reservation is canceled when an SCSI Lock agent is deleted. When you delete it, you must consider the fact that the shared disk to be reserved can be used at the other node. In other words, when you delete it, you must make sure the other node is down.

Ways to collect support files

When problems occur in MCCS, support file must be collected to collect log and preference information.
There are 2 ways to collect support file.

How to collect by using the web console

In the MCCS web console, click 'File' on the menu bar to collect support files.
Image Added
[Figure] Collecting Support Files from Menu Bar
Support files can be collected by clicking the toolbar shown in the figure below.
Image Added
[Figure] Collecting Support Files from Toolbar
You can select a node to collect support files from and get the previous support file again.
Image Added
[Figure] Support File Node Selection and Previous Support File Selection
Click 'OK' button and support file is collected.
Image Added
[Figure] Support Files Being Collected

Info
It may take several minutes depending on the log file capacity and the network condition.
The collected support files can be checked in the designated location.
Image Added
[Figure] Support Files

Collecting file using script files

Script file is located as below:

Code Block
$MCCS_HOME/bin/Support/support.cmd

Info
이 방법은 스크립트를 실행한 노드의 정보만 수집합니다.

...

This way can only collect information from the running node.

Collected support file is created in the following directory.

Code Block
$MCCS_HOME/support-$HOSTNAME/$HOSTNAME.zip

Info
만약 이미 서포트 파일이 존재한다면 새로운 파일이 기존의 파일을 덮어쓰게 되므로 수집전에 주의 하시기 바랍니다.If the support file exists, new file will be over-writed, so please be aware.

Version	Old Version 7	New Version Current
Changes made by	권홍선	정은진
Saved on	2014/07/21	2015/05/29

Content Comparison

Versions Compared

Key

EMS(Emergency Message Service)의 활용

EMS 구성요소

EMS 에이전트

EMS 서버

EMS 업무 흐름

로그 저장

로그 분석

SMS 통보

EMS 서버 접속 후, 장애 원인 분석

EMS 서버의 통합 모니터링 화면

서버 장애

운영 서버 장애

대기 서버 장애

응용프로그램 장애

네트워크 장애

서비스 네트워크 장애

핫빗 네트워크 장애

복제 네트워크 장애

단일 네트워크 스위치 장애

디스크 장애

미러 디스크 장애 (내장 스토리지)

소스 디스크 장애

Image Removed

미러 디스크 리소스의 Split Brain

MCCS 웹 콘솔을 사용해서 스플릿브레인을 해결하는 방법

SCSI Lock 장애

SCSI3-PR 을 사용하는 볼륨매니저와 연동할 때

SCSI3-PR을 지원하는지 확인할 때

sg_scan.exe 또는 sg_persist.exe 패스를 못 찾을 때

공유 디스크 에이전트와 연동할 때

등록 키 충돌 오류가 날 때

하나의 디스크에 여러 레터가 존재할 경우 한개의 레터에 예약 시 나머지에 레터에 접근을 못할 때

에이전트 등록 후 DUID가 해결되지 못한 상태로 유지될 때

에이전트 삭제할 때

서포트 파일을 수집하는 방법

콘솔을 통해서 수집하는 방법

스크립트 파일을 실행해서 수집하는 방법

How to use EMS(Emergency Message Service)

EMS Component

EMS Agent

EMS Server

EMS Workflow

Save Log

Log Analysis

SMS Notification

After connecting to EMS server, analysis cause of failure

Control Monitoring of EMS Server Consolidated Web-based dashboard of EMS Server

Server Failure

Active Server Failure

Standby Server Failure

Application Failure

Network Failure

Service Network Failure

Heartbeat Network Fault

Replication (Mirroring) Network Failure

Single Network Switch Fault

Disk Failure

Mirror Disk Failure

Split Brain of Mirror Disk Resource

External Storage Failure

SCSI Lock Failure

Ways to collect support files

How to collect by using the web console

Collecting file using script files