Disk Error

Disk failure (error) refers to a situation in which disk I / O fails due to an unexpected failure such as disconnection of the physical connection terminal of the disk storage layer, media corruption, bad sector, or SCSI error. These disorders occur temporarily and then become normal and sometimes lead to permanent disorders. bsr categorizes these disorders as temporary and permanent, and treats them differently depending on the type.

Temporary failure is a situation where, for some reason at the storage layer, an error occurs briefly and then normalizes again. In this case, since it is not a serious situation that requires a replacement, it is efficient to keep the service running as much as possible to resolve the errors that have occurred and to keep the replication running. That is, in a temporary error situation, the block area where the I/O error occurred is recorded as out-of-sync, and if the I/O retried with the block succeeds, the OOS is naturally resolved.

Permanent disk failures require recovery from disk failures, such as disk replacement, and must be repaired through a full rebuild configuration procedure.

I/O error handling policy is set through the on-io-error option in the resource <disk> section.

resource <resource> {
  disk {
    on-io-error <strategy>;
    ...
  }
  ...
}

If the option is defined in the <common> section, it applies to all resources.

<strategy> has three options:

passthrough The default value of on-io-error is to report I/O errors to the upper layer. I/O errors are sent to the mounted file system in the case of the primary node, and error results to the primary node, but in the secondary node errors are ignored (if there is no replication connection). At this time, the disk status is maintained and the error block is recorded in OOS.
detach When a lower-level I/O error occurs, the node detaches the backup device from the replication volume and switches to diskless state.
call-local-io-error. Call the command defined by the local I/O error handler. This option is only available if local-io-error <cmd> is defined in the <handlers> section of the resource. Using the local-io-error call command or script, I/O error handling is entirely at the user's discretion.

The error handling policy can be applied in real time through the adjust command even if the resource is in operation.

Disk failures occur more often than expected, based on experience in operating replication services. These results depend on the sub-disk layer. This means that errors in the disk layer, that is, the standard SCSI layer, can occur at any time and at any time, which means that they must be handled separately from the stability of the disk layer and be flexible in terms of replication.

The detach policy, which has been provided as a disk failure policy, was a policy where replication was unilaterally stopped at a specific point in time in terms of service operation. This method is also difficult to post-recovery and disadvantageous in terms of continuing service operation. We devised a passthrough policy to solve this problem and set it as the default policy for bsr.

The pass-through policy records the OOS for the block when an I/O error occurs and delivers the failed I/O result to the file system. At this time, if the file system succeeds by retrying to write to the block where the error occurred and resolves the OOS through this, it will lead to the temporary overcoming of the disk layer error. Although OOS cannot be completely resolved depending on the operating characteristics of the file system, some remaining OOS can be resolved by resynchronizing through connection retries. In other words, the pass-through policy induces the Filesystem to resolve the error block by itself or through synchronization, and basically ensures that the service continues to operate even if there is a problem with the disk I/O.

Temporary error handling

If the I/O error policy is set to passthough, the result of the I/O error is transferred to the file system, and the bsr records the I/O block in error as OOS. If the I/O error was temporary, the OOS is retried by the file system and resolved by itself. Otherwise, the OOS area is resolved by inducing resynchronization through replication reconnection. At this time, it is left to the administrator or the HA operation logic to manage operations such as the direction of synchronization to resolve the OOS.

bsr cannot determine if an I/O error is temporary or permanent, and it does not collect specific error statistics such as sector-by-sector I/O error tracking. Instead, only the number of I/O error occurrence information for the device in which the error occurred is maintained. Although the specific information is not shown, the degree of error can be determined based on the number of I/O errors in bsradm status. If a temporary error occurs, the io error and oos shown in status will only be recorded in a small amount and will not increase over time.

Permanent error handling

All disk failure situations are considered as permanent errors except those in which some OOS is recorded due to temporary I/O errors.

Permanent errors can occur in a variety of situations. For example, if the physical damage (bad block) to the disk occurs, cable of storage is disconnected, the volume is removed, and the SCSI controller has failed. This includes non-replicating situations in which the volume is removed due to the mistake of administrator or other programs that are not compatible with bsr.

If this permanent error occurs, the administrator must reconfigure replication or replace the disk. To fix a permanent disk failure, you must first down the resource and then detach the disk from the replication state. Of course, you must reinitialize the metadisk for any replication resource that needs reconfiguration.

It is possible to automatically perform resource detaching by using detach policy, but it is recommended to set the passthrough policy rather than the detach. The passthrough policy is more reasonable in terms of service operation as it can actively respond to temporary failures on disk.

Disk replacement

Regenerate the meta data set and reconnect the resources as follows. If necessary, perform a full synchronization by explicitly executing the synchronization command.

C:\Program Files\bsr\bin>bsradm down <resource>
  
C:\Program Files\bsr\bin>bsradm create-md <resource>
v08 Magic number not found
Writing meta data...
initialising activity log
NOT initializing bitmap
New bsr meta data block sucessfully created.
 
 
C:\Program Files\bsr\bin>bsradm up <resource>
C:\Program Files\bsr\bin>bsradm invalidate <resource>

Etc

For disk read errors and meta disk I/O errors, automatic detach policy is applied to stop replication, and resources that have been stopped can be recovered only through reconfiguration.

Node failure

If the node goes down due to hardware failure or manual intervention, the other node detects the situation, switches the connection state to Connecting, and waits in disconnected mode until the other node reappears. Node failures are usually treated as temporary failures that are self-recovering, but must be manually recovered in the permanent failure.

Secondary failure

If the target side goes down, the source side can continue operating in disconnected mode, but modifications to the block will not be replicated to the other node. However, the information on the block being changed is stored internally and waits for synchronization with the changed block when the counterpart node is restored later.

Primary failure

During replication, the primary node may stop unexpectedly due to power failure. This state is called the Crashed Primary state. At this time, the other node detects that the primary node has disappeared and switches to disconneted mode. The crashed primary node is restarted through the booting process and started in an automatic secondary role by the service. It does not automatically promote the Primary role here. The Crashed Primary node establishes a connection with the Secondary node, and in the process, both nodes recognize each other that they are in the Crashed Primary state. After that, the consistency of both nodes is matched through the process of synchronization. in this case, when the failed node recovers, manual intervention is no longer required, but the cluster management application (HA) can determine the promotion of a particular node and continue its operational services.

If the source side goes down, HA must first decide whether to switch over. Depending on whether or not the switchover is made, it is determined by which node to synchronize when the source node is later restored and joined to the cluster.

When switching, the node that was the standby node acts as the new source node.
If you wait without switching, the node that was the original source node acts as the synchronization source node.

In case of primary node failure, special mechanism through AL will match the consistency of the block device. Please refer to the activity log of BSR Internals for details on this.

Permanent failure

In the event of a permanent or non-recoverable failure, you must perform the following steps.

Replace failed hardware with hardware of similar performance and disk capacity.

Note that if the target is replaced with a disk with a smaller capacity than the existing one, the replication connection is rejected.

Install basic systems and applications.
Install bsr and copy the files under /etc/bsr.d/ on the running node and /etc/bsr.conf.
It reconfigures according to the resource reconfiguration procedure, starts the resource, and starts the initial synchronization.

If Primary is already running, synchronization starts automatically when connected, so there is no need to synchronize manually.

Split-brain

Split-brain behavior settings

bsr provides a way to automatically notify the operator when a split brain is detected.

Split-brain notification

If you want to detect when a split brain occurs, you can configure a split-brain handler to detect it. The following is an example of resource configuration.

resource <resource>
  handlers {
    split-brain <handler>;
    ...
  }
  ...
}

<handler> is composed of executable modules such as scripts on the system. It is usually recommended to configure it in an executable form such as a batch file or shell script.

notice

Since the return result of the split-brain handler is synchronously linked with the bsr kernel engine, if the execution result of the handler is not returned, it may affect the kernel engine. We recommend that you configure it as a simple, uncomplicated script.

On Linux, split brain handler is enabled by default, but on Windows it is disabled by default. You can activate the handler service using the following command:

drbdcon /handler_use 1

Manually recover

If it detects that both nodes were in the primary role after disconnection of the replication connection, it is determined as a split-brain when reconnecting with the other node, and the replication connection is immediately disconnected. And it leaves the following message in the log:

Split-Brain detected, dropping connection!

When a split brain is detected on one node, the resource is in the StandAlone state. If both nodes detect the split brain at the same time, both nodes will be in the StandAlone state, but if the other node first disconnects before detecting the split brain on one node, the SB will not be detected and will remain connecting.

In order to recover the split brain, unless the split brain automatic recovery configuration is configured, you must first select the node to discard the changes and manually recover it using the following procedure.

First, put the StandAlone state on both nodes with the following command.

bsradm disconnect <resource>

Execute the following command on the node(victim) to be discarded.

bsradm secondary <resource>
bsradm connect --discard-my-data <resource>

Connect from the Survivor node. If the node was already in the connecting state, you can skip this step because it will automatically connect.

bsradm connect <resource>

Upon successful connection, the replication status of the victim is immediately changed to SyncTarget, and data changes between nodes are synchronized based on the survivor node.

The data in the split brain's victim is not synchronized across the entire device. It simply rolls back the victim's modifications and propagates them from the split brain survivor to the victim node. In other words, synchronization is performed only for the change.

When resynchronization is complete, the split brain is considered resolved, and the two nodes are restored back to the fully matched redundant replication system.

Multiple split-brain

In a 1: N replication environment, multiple split brains can occur between nodes.

For example, in the 1:2 replication configuration of A-B-C, each node is disconnected, all resources are promoted, and I/O is performed on the volume. After this, if you demote each node and reconnect, a split brain occurs between each node. At this time, the split brain is a multiple split-brain situation between A-B nodes, between B-C nodes, or between A-C nodes, and multiple SBs occur between nodes in a cluster. In this case, the resolution of multiple SBs should be solved sequentially according to the following procedure based on the individual connection between nodes.

Determine Survival node and Victim node and disconnect from all nodes to create StandAlone state.
The SB is resolved between the nodes where the SB occurred.
- Survival node enters the Connecting state through the command “drbdsetup connect [resource] [victim node-id]
- Victim node resolves SB through the command drbdsetup connect [resource] [survival node-id] --discard-my-data
After SB is resolved, restore the connection between the unconnected Victim nodes. drbdsetup connect [resource] [victim node-id]
- In some cases, a secondary SB may occur again while resolving the SB to restore the connection. In this case, solve the SB by the same procedure as above.

Auto Recover

When a split brain occurs, manual split brain recovery is recommended for administrators, but in some cases it may be a good to automate the process.

bsr provides several algorithms to automatically repair split brain.

Discarding revisions of the Younger Primary node: When a network connection is restored and a split brain is detected, the revisions of the node that has recently switched to the Primary role are discarded.
Older Primary Node Revocation: When the network connection is restored again and the split brain is detected, the revision of the node that had the primary primary role is discarded.
Discard Primary Data with Less Modifications-Both nodes see which node has the less modification and discard the contents of that node.
Recover to host with no data changes-If a node has no history of data modification during split brain, recover to that node and declare split brain resolution. But this is a very rare case. Even if the resource volume is mounted (even read-only) on the file system on both nodes, the contents of the volume are apt to be modified, and there is no possibility of automatic recovery after this.

Whether to use automatic split brain recovery is largely application-specific. For example, if you're hosting a database on bsr, auto-recovery can be a decent means for web applications that use a database associated with the user interface, with few changes to discard, but vice versa. In an environment in which it is difficult to dispose of it, a manual repair will be required by a person. Before enabling automatic split brain recovery, you should carefully consider your application's requirements.

Automatic recovery policy

To configure the automatic split-brain recovery policy, you need to understand some of the configuration options that bsr provides. bsr classifies recovery policies according to the number of nodes that were primary when SB is detected, and the automatic recovery policy is configured using the following keyword in the <net> section of the resource.

after-sb-0pri. When SB occurs, no node is in the primary role, and the recovery method is as follows.

disconnect : It does not recover automatically. If there is a configured SB handler script, it is called, and it remains in disconnected mode.
discard-younger-primary : Undo and revert the modifications on the node that was assumed to have played the most recent primary role. If the younger primary cannot be determined, it will work in the order of discard-zero-changes, discard-least-changes.
discard-least-changes : Amendments are canceled and reverted to the node with the smallest change.
discard-zero-changes : If there are nodes that have not changed, the changes from other nodes are applied.

after-sb-1pri. When SB occurs, the recovery method is as follows when one node is in the primary role.

disconnect : It does not recover automatically. If there is a configured SB handler script, it is called, and it remains in disconnected mode.
consensus : If SB victim can be selected, it will be resolved automatically. Otherwise, it works like disconnect.
call-pri-lost-after-sb : If the SB victim can be selected, the victim node calls the pri-lost-after-sb handler. This handler must be configured in the <handlers> section, and the node is forcibly removed from the cluster.
discard-secondary : Make the node currently in the Secondary role a victim of SB.

after-sb-2pri. This is the situation where both nodes are in the Primary role at the time the SB is discovered. The recovery method can only use manual recovery through disconnect.

The following is an example of a configuration file with SB auto recovery.

resource <resource> {
    handlers {
        split-brain "C:/Tools/script/split-brain.bat";
        ...
    }
    net {
        after-sb-0pri discard-zero-changes;
        after-sb-1pri discard-secondary;
        ...
    }
    ...
}

The files used in the configuration file of the SB handler must be described as absolute paths.

Compatibility

Describes compatibility issues that arise from working with third-party products. Most compatibility issues in BSR appear as errors because the functionality or behavior of another program or module affects the operation of BSR.

Volume references

BSR maintains dereferencing of volumes internally during resource operations. Volume reference count is used to manage the lifecycle of BSR volumes by incrementing when a reference is acquired and decrementing when a reference is de-referenced.

If the reference count for a volume is non-zero at the time of a resource DOWN, the DOWN will fail with the following type of error log output.

bsr volume(r0) Secondary failed (error=0: State change failed: (-12) Device is held open by someone

To resolve this issue, you must identify the process or module that is referencing the BSR volume and take steps to reduce the number of references to the volume, such as disabling the behavior of that module.

For example, on Ubuntu 20.04 and later, multipath-tools (0.8.3) must be prevented from making references to the bsr volume through the following exception in multipath.conf. Otherwise, multipath-tools will continue to reference the bsr volume, making it impossible to down.

blacklist {
        devnode "^(sd)[a-z]"
        devnode "^(bsr)[0-9]"
        }

Troubleshooting