Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

 

...

 

...

 

Section

 

Column

After configuring redundancy environment using MCCS, some failures might occur.
This chapter will explain how MCCS detects the failure and administrates after failure or failover is done. 
(In the following example, the operating server as 'Active', standby server name as 'Standby' is registered on MCCS.)

 

Column
width350px

 

Panel

이 페이지의 주요 내용

Table of Contents
maxLevel4

 

 

...

This is the case of system being rebooting or shut down because of conflicts of each device (NIC, Raid Controller), kernel driver problem of other application.

Active Server Failure

  1. 서버의 정상 혹은 비정상 종료에 따른 MCCS의 역할에는 차이점이 없습니다. MCCS는 운영 서버에서 장애가 발생하면 대기 서버로 페일오버를 진행합니다.
    화면의 오른쪽에 있는 노드 관리에서 해당 서버를 선택하면 '리소스 상태' 및 '리소스 의존성' 화면을 통하여 장애를 확인할 수 있습니다 There is no difference in the MCCS role resulting from abnormal or normal termination of the server. MCCS will perform a failover to the standby server when the operation server fails. In the node management menu on the right side of the screen, select the server. You can check the details of failures in the 'Resource Status' & 'Resource Dependency' screens.
      • Normal Termination of a system
        This is a case where user selected 'system shutdown' in operating systems.
      • Abnormal Termination of a system 
        This is a case where system is terminated or rebooted due to an unexpected situation or blue screen.

    [ Figure] Failure in Active Server

  2. Since data cannot be replicated due to the server failure, will be shown in the mirror disk resource.
  3. Server operators check on the failure and put the server back to normal.
  4. After checking on the mirror role of two servers when server with the failure is rebooted, switch the server with the failure as replication target and proceed partial resync.

...

  1. MCCS will show the failure when failure occurs in standby server.
  2. Data replication will be paused until standby server is back to normal.


    [Figure] Failure in Standby Server

  3. 데이터 동기화가 불가능해지고 미러 디스크가 '네트워크 연결 실패' 상태Data synchronization cannot be achieved. Mirror disk becomes the 'Network Connection Failure' state  )가 됩니다.
  4. 대기 서버에서 장애가 발생하면 운영상에는 문제가 없지만 페일오버할 대상이 없으므로 서버 운영자는 반드시 MCCS 웹 콘솔을 통하여 장애를 확인하고 대기 서버를 정상화 시켜야 합니다.
  5. 대기 서버가 다시 정상으로 돌아되면, Image Removed 아이콘이 사라집니다.
  6. 미러디스크의 DiskState 상태 값은  'Inconsistent'  에서 'UpToDate' 상태로 변경되기 위해, 데이타 동기화(Image Removed)를 진행합니다. 
  7. 동기화가 완료가 되면, 현재 정상 데이터가 실시간 동기화되고 있다. ( Image Removed )는 상태로 변경됩니다.
  8. Even if the standby server failed, it does not affect operation. But as there is no server to perform failover to, the server operator must check the trouble in the MCCS web console and make sure that the standby server is normalized in time.
  9. When the standby server is normalized, the Image Added icon disappears.
  10. Mirror disk's DiskState value is changed from 'Inconsistent' to 'UpToDate' and for this purpose, data synchronization (Image Added) is performed. 
  11. When the synchronization is finished, the current normal data is synchronized in real time. The icon is changed to  Image Added 

Application Failure

Active application resources are operated by 4 elements below.

...

  1. MCCS periodically monitors the resources referring  'MonitorInterval'.
  2. If there is no response as the time set in 'MonitorTimeout', it is considered as a failure.
  3. If there are no response after sending the command as the number set in 'RestartLimit', MCCS will failover the group which resource belongs to.
  4. If the resource stays in normal state within the time limit set by 'OnlineTrustTime'. MCCS will initialize the attribute value of 'RestartLimit'. This is to ensure restart number when failure occurs in a resource.
  5. If there is a failover due to a failure in the resource, server operator checks on the problem and put it back to normal.
  6. 장애가 발생한 부분은 MCCS 웹 콘솔에서 확인할 수 있으며, 장애가 발생한 부분을 사용자가 확인한 후에 장애 표시를 제거해 주어야 다시 페일오버 기능이 활성화됩니다. 
    자동으로 장애 표시를 제거하고자 할 경우에는 그룹 속성의 AutoFaultClearTime에 0보다 큰 값을 설정하면 됩니다In the MCCS web console, a user can see where the trouble occurs. After a user checks the trouble area, they must remove the Trouble sign, so that the failover function can be activated again. 
    If you want the sign of failures to be removed automatically, enter a positive number in AutoFaultClearTime of the group attribute
  7. After checking on the mirror role of two servers when server with the failure is rebooted, switch the server with the failure as replication target and proceed partial resync.


    [Figure] Failure in Resource Clear

...

    • Service Network Failure

      If failure occurs in service network of active server, the fault mark will be shown on the network interface card resource or IP address of the node in MCCS UI, and will failover to the standby server.

      [Figure] Failure in Network Interface Card

 

  1. 서비스 네트워크 장애는 장애가 발생한 부분을 MCCS 웹 콘솔에서 확인할 수 있습니다In the MCCS web console, you can check in which part of service network, trouble has occurred.
  2. MCCS checks network cable disconnection of server where network failure occurred, and whether ping timeout occurs from network.
  3. If IP address resource is the cause of the failure, user should check on the network switch or network interface card.
    When physical parts related to network is back to normal, select 'Clear Fault' from the MCCS web console and remove fault mark in order to re-enable the failover function. 
  4. 자동으로 장애표시를 제거하려면 그룹 소것ㅇ의 AutoFaultClearTime 에 0보다 큰 값을 설정하면 됩니다If you want the sign of failures to be removed automatically, enter a positive number in AutoFaultClearTime of the group attribute

    • Heartbeat Network Fault

      핫빗은 노드 상호간의 상태를 동기화하고 장애 상태를 결정하는 중요한 역할을 하기 때문에 반드시 이중화되어 있어야 합니다. 이중화된 핫빗 네트워크 중에서 어느 하나라도 장애가 발생하면 장애 내용은 로그창에 표시 됩니다. 
      하지만 MCCS 웹 콘솔에는 아무런 변화가 나타나지 않습니다. 이것은 운영 서버와 대기 서버에는 아무런 문제가 없다는 것을 뜻합니다Heartbeat should be dualized because it plays a very important role of synchronizing the inter node status and determining the condition of failure. If any one of the dualized heartbeat network fails, the details of failure is displayed in the log window.
      However, the MCCS web console has no changes. It means that the operation server or the standby server has no problems
      At this point, when failure occurs in active server and needs to failover to the standby server, MCCS will use redundant normal heartbeat network to failover.

      If all the redundant heartbeat is disconnected, MCCS will use the service network as heartbeat line.

      [Figure] Failure in Heartbeat

...

    • Replication (Mirroring) Network Failure

      복제 네트워크에 장애가 발생하면 데이터 복제를 진행할 수 없으며, MCCS 웹 콘솔의 미러 디스크 리소스 부분이 'Disconnect' 상태( Image Removed )로 표시 됩니다When the copying network failed, data copying cannot be done. The mirror disk resource of MCCS web console displays the 'Disconnect' ( Image Added).

      [Figure]Failure in Replicated Network


...

    • Single Network Switch Fault

      When failure occurs in network switch connected to Public Network where it is configured by single network switch, all the resources in active and standby server will be taken offline, resources where failure occurs will show as 'fault'.

      [Figure] Failure in Network Switch    

  1. Network switch failure can be checked on MCCS log, OS System log. If failure occurs in service network connection, server operator should check on the TCP/IP of server, physical connection check on the service network through ping test.
  2. 자동으로 장애 표시를 제거하려면 그룹 속성의 AutoFaultClearTime에 0보다 큰 값을 설정하면 됩니다If you want the sign of failures to be removed automatically, enter a positive number in AutoFaultClearTime of the group attribute.
  3. Please get the supports regarding the recovery of Network switch failure through manufacturer.



...

  1. Availability of disk monitoring of MCCS are as below.
    • Periodic read/write test on the disk.
    • Determines whether drive letter exists in the disk.
    디스크 장애 발생 요인은 다음과 같은 경우가 있을 수 있습니다. 위의 문제가 해결 된 후에 운영 체제는 변경된 디스크를 다시 인식합니다.이후 DRBD에서 동기화를 진행합니다.
  2. 디스크 컨트롤러 문제 하드웨어 자체의 문제는 해당 업체에서 해결해야 합니다.
  3. 물리적인 디스크 문제 하드웨어 자체의 문제는 해당 업체에서 해결해야 합니다.
    미러 리소스에서 동기화가 진행되지 않으면 미러디스크 리소스를 삭제한 후에 다시 생성해야 합니다. 단, 삭제 시 리소스만 삭제가 아니라 생성된 미러까지 삭제하고 다시 생성해야 합니다

  4. Disk failure can be caused by the following. After resolving the above issues, the OS will detect the newly changed disk again. After that, DRBD will proceed with synchronization.
    • Disk controller problems or H/W problems should be fixed by the manufacturers.
    • Physical disk problems or H/W problems should be fixed by the manufacturers.

  5. If the mirror disk does not perform synchronization, delete the mirror disk resource and try to create it again. But when you delete the resource, you must also delete the created mirror and create them again

...

  • Target Disk Failure
    대기 서버의 디스크에 장애가 발생하면 MCCS 웹 콘솔의 디스크 리소스 아이콘은 변화가 없고, Diskstate 의 속성 값이 UptoDate에서 Diskless 로 변경 됩니다. 그러나 소스 서버에서 운영중인 서비스에는 영향을 미치지 않습니다If the disk at the standby server failed, the disk resource icon of MCCS web console is not changed. But the attribute values of Diskstate is changed from UptoDate to Diskless. However, the service running in the source server is not affected by it

    [그림Figure] 타깃 디스크 장애 발생 화면

...

  • 디스크 컨트롤러 문제 하드웨어 자체 문제는 해당 업체에서 해결해야 합니다.
  • 물리적인 디스크 문제 하드웨어 자체 문제는 해당 업체에서 해결해야 합니다.
  • Failure in Target Disk

  1. When MCCS detects failures of the target disk, only the DiskState value of the disk is displayed. 
  2. Disk failure can be caused by the following. After resolving the above issues, the OS will detect the newly changed disk again. After that, DRBD will proceed with synchronization.
    • Disk controller problems or H/W problems should be fixed by the manufacturers.
    • Physical disk problems or H/W problems should be fixed by the manufacturers.
  3. If the mirror disk does not perform synchronization, delete the mirror disk resource and try to create it again. But when you delete the resource, you must also delete the created mirror and create them again

  •  MCCS 웹 콘솔을 사용해서 스플릿브레인을 해결하는 방법

리소스 속성창을 확인합니다. 
Image Removed
[그림] 미러디스크 스플릿 브레인 확인

...

미러 관리창을 확인합니다.

Image Removed
[그림] 미러디스크 스플릿 브레인 확인

Warning

1) 양노드의 ConnectState는 StandAlone이며, SplitBrainStatus 값은 True가 됩니다.
2) 미러디스크의 LastMirrorOnlineTime을 확인합니다. (LastMirrorOnlineTime은 시스템의 시간이므로 최신 데이터의 유무를 결정할 수 있는 절대값 아닙니다)
3) 스플릿 브레인이 발생했을 때 발생하는 로그가 출력됩니다. 
(DRBD 볼륨(r0)에 스플릿 브레인이 발생했습니다.)
4) 미러 관리창에서 미러 상태가 'SPLIT' 상태입니다.

...

Warning

노드 B 의 변경된 정보는 모두 덮어써지게 됩니다

 

External Storage Failure

...

  • Split Brain of Mirror Disk Resource 

    매우 드문 경우이지만 두 서버상에서 미러 디스크 역할이 모두 Primary로 인식되었고, 웹 콘솔의 데이터 값이 불일치 하는 경우입니다. 
    이러한 상황은 타깃이 소스로 변경되는 시점에 기존 소스가 타깃으로 변경되지 못한 결과이며, 이 때는 서로 자신의 데이터를 동기화하려 하지만 이전 데이터 값이 불일치하기 때문에 자동으로 동기화하지 않게 됩니다. 
    미러 디스크에서 Split Brain이 발생하는 상황은 다음과 같습니다.

...

  • This is a rare case but in this case, the mirror disk roles at the two servers are recognized as primary and the data on the web console is not matched.
    The situation arises because the existing source is not switched to the target when the source and the target need to be switched. In this case, they try to synchronize their own data but due to the mismatch between the previous data, automatic synchronization failed
    . 
    In the mirror disk, a split brain can occur as follows.
  1. Due to the failure of source server (A), a failover occurs.
  2. The role of target server (B) is changed to Primary. (Mirror disk role changed)
  3. Reboot the initial source server (A).
  4. After the initial source server (A) boots, check the role of the target server (B).
  5. Check the GI value for the both nodes.
  6.  Check if the GI data is matched, and if it is, proceed with data synchronization automatically. (5th/ 6th processes checking fails.)
  7. GI data is not matched. So, synchronization is required on one node. No automatic synchronization is taking place.  (Split brain occurred.)
     
    When this state is reached, the icon of mirror disk resource is overlapped in the MCCS web console. Both the Image Added and the 'SplitBrainStatus' attribute values will be set to true.
    In this case, you need to manually change the mirror disk role. After that, you need to resynchronize it.
    If you want to manually change the mirror disk role, you need to use the MCCS web console
    . 

  •  How to resolve the split brain issues by using the MCCS web console
  1. Check the resource attribute view
    Image Added
    [Figure] Verify SplitBrain of MirrorDisk


  2. Check the mirror management view.

    Image Added
    [Figure] Checking Mirror Disk Split Brains

    Warning

    1) The ConnectState of both node is StandAlone and SplitBrainStatus values are set to True.
    2) Check LastMirrorOnlineTime on the mirror disk. (LastMirrorOnlineTime is the system time. So, it is not the absolute value used to determine whether it is the latest data.)
     3) When a split brain occurs, the log will be displayed.
    (DRBD volume (r0) has a split brain.)
     4) In the mirror management window, the mirror condition is set to 'SPLIT'.

  3. Select the mirror disk and right click with your mouse button and click on 'Resolve Split Brains'.
    Image Added
    [Figure] Split Brain Resolving Selected


  4. Display the window to explain split brains.
    Image Added 
    [Figure] Checking the Source Node Selection

  5. Select the source node.
    Image Added
    [Figure] Source Roll Node Selection


  6. Recheck the selected source node.
    Image Added
    [Figure] Rechecking the Source Node Selection


  7. Split brains problems being resolved.
    Image Added
    [Figure] Split Brain Resolved


  8. Resolving split brains problems is finished.
    Image Added
    [Figure] Resolving Split Brain Finished


  9. The selected node becomes the source node and the mirror disk condition is changed from DiskState to UpToData. 
    Image Added
    [Figure] Split Brain Resolve


    Warning

    The changed information of node B will be all overwritten.

     

External Storage Failure

When the external disk fails or has a bad connection path, you cannot read/write the disk. So, MCCS will display the sign of failure and proceed with a failover.

[Figure] Failure in Shared Disk

...

SCSI Lock disk supports basic disk and single letter. Please do not use the disk that uses dynamic disk or multiple volume(use one LUN to configure various partition).

When maintaining the state where DUID is not solved after registering agent

레터를 정의하고 활성화를 요청해야 main.json에 해당 레터에 연결된 DUID 정보가 기록됩니다


When maintaining the state where DUID is not solved after registering agent

You must first define the letter and request activation before the information of DUID connected to the letter is recorded in main.json.


When delete agent

Reservation is canceled when an SCSI Lock 에이전트가 삭제될때 예약을 해제합니다. 따라서 예약대상 공유디스크가 상대 노드에서 사용될 가능성을 염두에 두고 삭제를 해야 합니다. 즉 삭제할 경우에는 상대 노드를 다운시킨 후 작업하십시요agent is deleted. When you delete it, you must consider the fact that the shared disk to be reserved can be used at the other node. In other words, when you delete it, you must make sure the other node is down.



Ways to collect support files

When problems occur in MCCS, support file must be collected to collect log and preference information.
There are 2 ways are 2 ways to collect support file.

How to collect by using the web console

  1.  In the MCCS web console, click 'File' on the menu bar to collect support

...

  1. files.

...


  1. MCCS 웹 콘솔에서 메뉴바의 '파일'을 클릭하여 서포트 파일을 수집할 수 있습니다.

    [그림] 메뉴바에서 서포트 파일 수집 Figure] Collecting Support Files from Menu Bar  

  2. Support files can be collected by clicking the toolbar shown in the figure below.

    [그림] 툴바에서 서포트 파일 수집
    서포트 파일을 수집할 노드의 선택과 이전에 받은 서포트파일을 다시 받을 수 있습니다Figure] Collecting Support Files from Toolbar

  3. You can select a node to collect support files from and get the previous support file again.

    [그림] 서포트 파일 노드 선택 및 이전 서포트 파일 선택 여부Figure] Support File Node Selection and Previous Support File Selection

  4. Click 'OK' button and support file is collected.

    [그림] 서포트 파일 수집 중 화면Figure] Support Files Being Collected

     

    Info

    로그파일의 용량과 네트워크의 상태에 따라서 몇 분이 걸릴 수도 있습니다.

    아래와 같이 다운로드 창이 열리게 되고 다운받으시면 됩니다.
    Image Removed
    수집된 서포트 파일은 지정된 위치에서 확인할 수 있습니다

    It may take several minutes depending on the log file capacity and the network condition.


  5. As shown below, you can download it from the download window.
    Image Added

  6. The collected support files can be checked in the designated location.

    [그림Figure] 서포트 파일Support Files


Collecting file using script files

...

Info

If the support file exists, new file will be over-writed, so please be aware.