Disk Error

Disk failure (error) refers to a situation in which disk I / O fails due to an unexpected failure such as disconnection of the physical connection terminal of the disk storage layer, media corruption, bad sector, or SCSI error. These disorders occur temporarily and then become normal and sometimes lead to permanent disorders. bsr categorizes these disorders as temporary and permanent, and treats them differently depending on the type.

Temporary failure is a situation where, for some reason at the storage layer, an error occurs briefly and then normalizes again. In this case, since it is not a serious situation that requires a replacement, it is efficient to keep the service running as much as possible to resolve the errors that have occurred and to keep the replication running. That is, in a temporary error situation, the block area where the I/O error occurred is recorded as out-of-sync, and if the I/O retried with the block succeeds, the OOS is naturally resolved.

Permanent disk failures require recovery from disk failures, such as disk replacement, and must be repaired through a full rebuild configuration procedure.

I/O error handling policy is set through the on-io-error option in the resource <disk> section.

resource <resource> {
  disk {
    on-io-error <strategy>;
    ...
  }
  ...
}

If the option is defined in the <common> section, it applies to all resources.

<strategy> has three options:

passthrough The default value of on-io-error is to report I/O errors to the upper layer. I/O errors are sent to the mounted file system in the case of the primary node, and error results to the primary node, but in the secondary node errors are ignored (if there is no replication connection). At this time, the disk status is maintained and the error block is recorded in OOS.
detach When a lower-level I/O error occurs, the node detaches the backup device from the replication volume and switches to diskless state.
call-local-io-error. Call the command defined by the local I/O error handler. This option is only available if local-io-error <cmd> is defined in the <handlers> section of the resource. Using the local-io-error call command or script, I/O error handling is entirely at the user's discretion.

The error handling policy can be applied in real time through the adjust command even if the resource is in operation.

Disk failures occur more often than expected, based on experience in operating replication services. These results depend on the sub-disk layer. This means that errors in the disk layer, that is, the standard SCSI layer, can occur at any time and at any time, which means that they must be handled separately from the stability of the disk layer and be flexible in terms of replication.

The detach policy, which has been provided as a disk failure policy, was a policy where replication was unilaterally stopped at a specific point in time in terms of service operation. This method is also difficult to post-recovery and disadvantageous in terms of continuing service operation. We devised a passthrough policy to solve this problem and set it as the default policy for bsr.

The pass-through policy records the OOS for the block when an I/O error occurs and delivers the failed I/O result to the file system. At this time, if the file system succeeds by retrying to write to the block where the error occurred and resolves the OOS through this, it will lead to the temporary overcoming of the disk layer error. Although OOS cannot be completely resolved depending on the operating characteristics of the file system, some remaining OOS can be resolved by resynchronizing through connection retries. In other words, the pass-through policy induces the Filesystem to resolve the error block by itself or through synchronization, and basically ensures that the service continues to operate even if there is a problem with the disk I/O.

Temporary error handling

If the I/O error policy is set to passthough, the result of the I/O error is transferred to the file system, and the bsr records the I/O block in error as OOS. If the I/O error was temporary, the OOS is retried by the file system and resolved by itself. Otherwise, the OOS area is resolved by inducing resynchronization through replication reconnection. At this time, it is left to the administrator or the HA operation logic to manage operations such as the direction of synchronization to resolve the OOS.

bsr cannot determine if an I/O error is temporary or permanent, and it does not collect specific error statistics such as sector-by-sector I/O error tracking. Instead, only the number of I/O error occurrence information for the device in which the error occurred is maintained. Although the specific information is not shown, the degree of error can be determined based on the number of I/O errors in bsradm status. If a temporary error occurs, the io error and oos shown in status will only be recorded in a small amount and will not increase over time.

Permanent error handling

All disk failure situations are considered as permanent errors except those in which some OOS is recorded due to temporary I/O errors.

Permanent errors can occur in a variety of situations. For example, if the physical damage (bad block) to the disk occurs, cable of storage is disconnected, the volume is removed, and the SCSI controller has failed. This includes non-replicating situations in which the volume is removed due to the mistake of administrator or other programs that are not compatible with bsr.

If this permanent error occurs, the administrator must reconfigure replication or replace the disk. To fix a permanent disk failure, you must first down the resource and then detach the disk from the replication state. Of course, you must reinitialize the metadisk for any replication resource that needs reconfiguration.

It is possible to automatically perform resource detaching by using detach policy, but it is recommended to set the passthrough policy rather than the detach. The passthrough policy is more reasonable in terms of service operation as it can actively respond to temporary failures on disk.

Disk replacement

Regenerate the meta data set and reconnect the resources as follows. If necessary, perform a full synchronization by explicitly executing the synchronization command.

C:\Program Files\bsr\bin>bsradm down <resource>
  
C:\Program Files\bsr\bin>bsradm create-md <resource>
v08 Magic number not found
Writing meta data...
initialising activity log
NOT initializing bitmap
New bsr meta data block sucessfully created.
 
 
C:\Program Files\bsr\bin>bsradm up <resource>
C:\Program Files\bsr\bin>bsradm invalidate <resource>

Etc

For disk read errors and meta disk I/O errors, automatic detach policy is applied to stop replication, and resources that have been stopped can be recovered only through reconfiguration.

Node failure

If the node goes down due to hardware failure or manual intervention, the other node detects the situation, switches the connection state to Connecting, and waits in disconnected mode until the other node reappears. Node failures are usually treated as temporary failures that are self-recovering, but must be manually recovered in the permanent failure.

Secondary failure

If the target side goes down, the source side can continue operating in disconnected mode, but modifications to the block will not be replicated to the other node. However, the information on the block being changed is stored internally and waits for synchronization with the changed block when the counterpart node is restored later.

Primary failure

During replication, the primary node may stop unexpectedly due to power failure. This state is called the Crashed Primary state. At this time, the other node detects that the primary node has disappeared and switches to disconneted mode. The crashed primary node is restarted through the booting process and started in an automatic secondary role by the service. It does not automatically promote the Primary role here. The Crashed Primary node establishes a connection with the Secondary node, and in the process, both nodes recognize each other that they are in the Crashed Primary state. After that, the consistency of both nodes is matched through the process of synchronization. in this case, when the failed node recovers, manual intervention is no longer required, but the cluster management application (HA) can determine the promotion of a particular node and continue its operational services.

If the source side goes down, HA must first decide whether to switch over. Depending on whether or not the switchover is made, it is determined by which node to synchronize when the source node is later restored and joined to the cluster.

When switching, the node that was the standby node acts as the new source node.
If you wait without switching, the node that was the original source node acts as the synchronization source node.

In case of primary node failure, special mechanism through AL will match the consistency of the block device. Please refer to the activity log of BSR Internals for details on this.

Permanent failure

In the event of a permanent or non-recoverable failure, you must perform the following steps.

Replace failed hardware with hardware of similar performance and disk capacity.

Note that if the target is replaced with a disk with a smaller capacity than the existing one, the replication connection is rejected.

Install basic systems and applications.
Install bsr and copy the files under /etc/bsr.d/ on the running node and /etc/bsr.conf.
It reconfigures according to the resource reconfiguration procedure, starts the resource, and starts the initial synchronization.

If Primary is already running, synchronization starts automatically when connected, so there is no need to synchronize manually.

Split-brain

Split-brain behavior settings

bsr provides a way to automatically notify the operator when a split brain is detected.

Split-brain notification

If you want to detect when a split brain occurs, you can configure a split-brain handler to detect it. The following is an example of resource configuration.

resource <resource>
  handlers {
    split-brain <handler>;
    ...
  }
  ...
}

<handler> is composed of executable modules such as scripts on the system. It is usually recommended to configure it in an executable form such as a batch file or shell script.

notice

Since the return result of the split-brain handler is synchronously linked with the bsr kernel engine, if the execution result of the handler is not returned, it may affect the kernel engine. We recommend that you configure it as a simple, uncomplicated script.

On Linux, split brain handler is enabled by default, but on Windows it is disabled by default. You can activate the handler service using the following command:

drbdcon /handler_use 1

수동 복구

복제 연결 단절 후 양 노드가 Primary 역할이었다는 것을 감지했다면 상대 노드와 재 연결하는 시점에 스플릿 브레인(Split-brain)으로 판단하고 즉시 복제 연결을 끊습니다. 그리고 로그에 다음과 같은 메시지를 남깁니다.

Split-Brain detected, dropping connection!

하나의 노드에서 스플릿 브레인이 감지되면 리소스는 StandAlone 상태가 됩니다. 양 노드가 동시에 스플릿 브레인을 감지했다면 양 노드 모두 StandAlone 상태가 되지만, 한 노드에서 스플릿 브레인을 감지하기 전에 상대 노드가 연결을 먼저 끊게 되면 SB를 감지하지 못한 채 단지 Connecting 상태로 유지합니다.

스플릿 브레인을 복구하기 위해선 스플릿 브레인 자동 복구 구성이 되어 있지 않는 한, 변경사항을 폐기할 노드(victim)를 먼저 선택하고 다음과 같은 절차를 통해 수동으로 복구해야 합니다.

먼저 양 노드에서 다음의 명령을 통해 StandAlone 상태로 만듭니다.

bsradm disconnect <resource>

폐기할 노드(victim)에서 아래 명령을 수행합니다.

bsradm secondary <resource>
bsradm connect --discard-my-data <resource>

운영 노드(survivor) 에서 연결합니다. 노드가 이미 Connecting 상태였다면 자동으로 연결되기 때문에 이 단계를 생략할 수 있습니다.

bsradm connect <resource>

정상적으로 연결되면 victim의 복제 상태는 즉시 SyncTarget으로 변경되고, 노드 간의 데이터 변경점들이 survivor 노드를 기준으로 동기화 됩니다.

스플릿 브레인의 폐기 노드(victim)의 데이터는 전체 디바이스 차원에서 동기화가 이루어지지 않습니다. 단지 victim의 수정사항을 롤백하고, 모든 수정사항에 대해 스플릿 브레인 survivor에서 victim노드로 전파됩니다. 즉 변경 분에 대해서만 동기화가 이루어 집니다.

재동기화가 완료되면 스플릿 브레인은 해결된 것으로 간주하며, 두 노드는 다시 완전히 일치된 이중화 복제 저장 시스템으로 복구됩니다.

다중 스플릿 브레인

1:N 복제 환경에선 노드들 간의 스플릿 브레인이 다중으로 발생될 수 있습니다.

예를 들어 A-B-C 의 1:2 복제 구성의 경우 각 노드의 연결을 단절하고 모든 노드의 리소스를 승격 후 볼륨에 I/O 를 수행합니다. 이 후 각 노드를 강등하여 재 연결하면 각각의 노드간에 스플릿 브레인이 발생됩니다. 이 때 스플릿 브레인은 A-B 노드간, B-C 노드간에 또는 A-C 노드 간의 스플릿 브레인 상황이며 클러스터 내 노드간의 SB가 다중으로 발생된 상황입니다. 이럴 경우 다중 SB의 해결은 노드들 간의 개별 연결 기준으로 다음의 절차에 따라 순차적으로 해결해야 합니다.

Survival 노드와 Victim 노드를 결정하고 모든 노드에서 disconnect 하여 StandAlone 상태를 만듭니다.
SB가 발생한 노드간에 SB를 해결합니다.
- Survival 노드는 drbdsetup connect [resource] [victim node-id] 명령을 통해 Connecting 상태로 진입
- Victim 노드는 drbdsetup connect [resource] [survival node-id] --discard-my-data 명령을 통해 SB 해결
SB가 해결되고 나면 연결이 되지 않은 Victim 노드 간의 연결을 복원합니다. drbdsetup connect [resource] [victim node-id]
- SB 를 해결하여 연결을 복원하는 도중 2차 SB가 발생될 경우도 있습니다. 이럴 경우 위와 동일한 절차로 SB를 해결합니다.

자동 복구

스플릿 브레인이 발생하면 관리자에게는 스플릿 브레인 수동 복구(Manual split brain recovery)가 권장되지만, 경우에 따라서는 그 과정을 자동화하는 것이 좋을 수도 있습니다.

bsr은 스플릿 브레인을 자동 복구할 수 있는 몇 가지 알고리즘을 제공합니다.

Younger Primary 노드의 수정분 폐기: 네트워크 연결이 다시 복구 되고 스플릿 브레인이 감지되면, 최근 Primary 역할로 전환한 노드의 수정분을 폐기합니다.
Older Primary 노드의 수정분 폐기: 네트워크 연결이 다시 복구 되고 스플릿 브레인이 감지되면, 처음 Primary 역할을 가졌던 노드의 수정분을 폐기합니다.
수정 내용이 적은 Primary 데이터 폐기 - 양쪽 노드에서 수정 내용이 더 적은 쪽 노드를 확인하여 그 노드의 내용을 폐기합니다.
데이터 변경 사항이 없는 호스트로 복구 - 스플릿 브레인 동안 데이터의 수정 이력이 없는 노드가 있다면 해당 노드로 복구시키고 스플릿 브레인 해결을 선언합니다. 하지만 이것은 아주 드문 경우입니다. 양쪽 노드에서 리소스 볼륨을 파일시스템에 마운트(심지어 읽기전용)만 하더라도 볼륨 내용은 수정사항이 발생하기 쉬우며, 그 후에는 이 방식으로 자동 복구될 가능성은 없다고 봐야 합니다.

자동 스플릿 브레인 복구를 사용할 지 여부는 대체로 응용프로그램에 따라 달라집니다. 예를 들어 bsr에서 데이터베이스를 호스팅하는 경우를 생각해보면 사용자 인터페이스와 연관된 데이터 베이스를 사용하는 웹 응용프로그램에서는 폐기 해야할 변경사항이 거의 없어서 자동 복구가 괜찮은 수단이 될 수 있지만 반대로 금융 데이터 처럼 그 어떠한 데이터라도 함부로 폐기가 힘든 성격을 가지고 있는 환경에서는 사람이 직접 수동 복구를 해야 할 것입니다. 자동 스플릿 브레인 복구를 활성화하기 전에 응용프로그램의 요구사항을 신중하게 고려해야 합니다.

자동 복구 정책

자동 split-brain 복구 정책을 구성하기 위해서는 bsr이 제공하는 몇 가지 구성 옵션을 이해해야 합니다. bsr은 SB 감지 시 Primary로 있었던 노드의 수에 따라 복구 정책을 구분하고, 이에 대한 자동 복구 정책은 리소스의 <net> 섹션에 다음과 같은 키워드를 사용하여 구성합니다.

after-sb-0pri. SB가 발생된 시점에, 어떤 노드도 Primary 역할에 있지 않은 상황으로 복구 방법은 다음과 같습니다.

disconnect : 자동으로 복구하지 않습니다. 구성된 SB 핸들러 스크립트가 있다면 이를 호출하고, disconnected 모드를 지속합니다.
discard-younger-primary : 마지막으로 Primary 역할을 했다고 추정되는 노드에서 수정사항을 취소하고 되돌립니다. younger primary를 판단할 수 없다면 discard-zero-changes, discard-least-changes 순서로 동작하게 됩니다.
discard-least-changes : 최소의 변경 사항이 발생한 노드에서 수정사항을 취소하고 되돌립니다.
discard-zero-changes : 변경 사항이 발생하지 않은 노드가 있다면, 다른 노드의 변경 사항을 적용합니다.

after-sb-1pri. SB가 발생된 시점에, 하나의 노드가 Primary 역할에 있는 상황으로 복구 방법은 다음과 같습니다.

disconnect : 자동으로 복구하지 않습니다. 구성된 SB 핸들러 스크립트가 있다면 이를 호출하고, disconnected 모드를 지속합니다.
consensus : SB victim이 선택될 수 있다면 자동으로 해결합니다. 그렇지 않으면, disconnect처럼 동작합니다.
call-pri-lost-after-sb : SB victim이 선택될 수 있다면, victim 노드에서 pri-lost-after-sb 핸들러를 호출합니다. 이 핸들러는 <handlers> 섹션에 구성되어야 하고, 클러스터에서 노드가 강제로 제거됩니다.
discard-secondary : 현재 Secondary 역할에 있는 노드를 SB victim으로 만듭니다.

after-sb-2pri. SB가 발견된 시점에 양 노드에서 Primary 역할에 있는 상황으로, 복구 방법은 disconnect 를 통한 수동 복구만 사용할 수 있습니다.

구성 파일 작성 예시는 다음과 같습니다.

resource <resource> {
    handlers {
        split-brain "C:/Tools/script/split-brain.bat";
        ...
    }
    net {
        after-sb-0pri discard-zero-changes;
        after-sb-1pri discard-secondary;
        ...
    }
    ...
}

SB 핸들러의 배치 파일 구성에서 사용하는 파일들은 절대 경로로 기술되어야 합니다.