Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  • passthrough The default value of on-io-error is to report I/O errors to the upper layer. I/O errors are sent to the mounted file system in the case of the primary node, and error results to the primary node, but in the secondary node errors are ignored (if there is no replication connection). At this time, the disk status is maintained and the error block is recorded in OOS.

    • max-passthrough-count You can't have an infinite number of passthroughs. If more than a certain number of passthroughs are repeated, it should be considered a permanent disk failure, and you specify the threshold here.

  • detach When a lower-level I/O error occurs, the node detaches the backup device from the replication volume and switches to diskless state.

  • call-local-io-error. Call the command defined by the local I/O error handler. This option is only available if local-io-error <cmd> is defined in the <handlers> section of the resource. Using the local-io-error call command or script, I/O error handling is entirely at the user's discretion.

The error handling policy can be applied in real time through the adjust command even if the resource is in operation.

Info

Disk I/O errors happen more often than you might think. This means that BSR replication must be able to flexibly cope with these disk I/O errors from the replication side as well, given that it depends on the lower disk layer, and errors in the SCSI layer can occur at any point in time. The DETACH Features of passthrough policy

bsr must be able to flexibly respond to errors occurring in the lower SCSI storage layer.

The detach policy, which has been provided as a disk failure policy, is was a policy that unilaterally stops replication in which replication was unilaterally stopped at a certain specific point in time , which is disadvantageous from a service operation point of view, as it perspective. This method is difficult to recover after the fact and is also disadvantageous from a disadvantageous in terms of continuing service operation continuation point of view. We devised the PASTHROUGH policy in response to these issues and set it as the default policy for BSR. When .

The pass-through policy records OOS for the corresponding block when an I/O error occurs , the passthrough policy records an OOS for the block and forwards the and delivers the failed I/O result to the filesystem. If the filesystem then rewrites the failed block to clear the OOS, this will encourage the filesystem to overcome the transient disk-file system. At this time, if the file system resolves OOS by rewriting the block where the error occurred, this will lead the file system to overcome the temporary disk layer error on its own. Even if the filesystem does not completely resolve the OOS due to its behavior, some of the OOS cannot be completely resolved depending on the operating characteristics of the file system, some remaining OOS can be resolved by resynchronization, such as by retrying the connection. In other words, the resynchronizing through connection retries, etc.

The pass-through policy encourages guides the FS to resolve the error block by itself error blocks on its own or through synchronization, and essentially guarantees basically ensures that the service will continue to operate service operation continues even if there is a are disk I/O problemproblems.

Temporary error handling

If the I/O error policy is set to passthough, the result of the I/O error is transferred to the file system, and the bsr records the I/O block in error as OOS. If the I/O error was temporary, the OOS is retried by the file system and resolved by itself. Otherwise, the OOS area is resolved by inducing resynchronization through replication reconnection. At this time, it is left to the administrator or the HA operation logic to manage operations such as the direction of synchronization to resolve the OOS.

...

It is possible to automatically perform resource detaching by using detach policy, but it is recommended to set the passthrough policy rather than the detach. The passthrough policy is more reasonable in terms of service operation as it can actively respond to temporary failures on disk.

Info

bsr cannot determine if an I/O error is temporary or permanent, and it does not collect specific error statistics such as sector-by-sector I/O error tracking. Instead, only the number of I/O error occurrence information for the device in which the error occurred is maintained. Although the specific information is not shown, the degree of error can be determined based on the number of I/O errors in bsradm status. If a temporary error occurs, the io error and oos shown in status will only be recorded in a small amount and will not increase over time.

...

If this permanent error occurs, the administrator must reconfigure replication or replace the disk. To fix a permanent disk failure, you must first down the resource and then detach the disk from the replication state. Of course, you must reinitialize the metadisk for any replication resource that needs reconfiguration.

It is possible to automatically perform resource detaching by using detach policy, but it is recommended to set the passthrough policy rather than the detach. The passthrough policy is more reasonable in terms of service operation as it can actively respond to temporary failures on disk.

Disk replacement

Disk replacement

Regenerate the meta data set and reconnect the resources as follows. If necessary, perform a full synchronization by explicitly executing the synchronization command.

...

The following is an example of a configuration file with SB auto recovery.

Code Block
resource <resource> {
    handlers {
        split-brain "C:/Tools/script/split-brain.bat";
        ...
    }
    net {
        after-sb-0pri discard-zero-changes;
        after-sb-1pri discard-secondary;
        ...
    }
    ...
}
Info

The files used in the configuration file of the SB handler must be described as absolute paths.

Compatibility

Describes compatibility issues that arise from working with third-party products. Most compatibility issues in BSR appear as errors because the functionality or behavior of another program or module affects the operation of BSR.

Volume references

BSR maintains dereferencing of volumes internally during resource operations. Volume reference count is used to manage the lifecycle of BSR volumes by incrementing when a reference is acquired and decrementing when a reference is de-referenced.

...

Device reference errors

BSR internally maintains reference count for backing devices. The reference count is incremented when a device is opened by a particular process and decremented when it is closed to manage the lifecycle for the device and to force a particular process to no longer reference the device volume when it is time to clean up its resources.

Problems arise when the reference count is non-zero, either at the beginning of configuring the BSR resource or during the process of umounting the device. Because some process opened the device and never closed it, BSR detects this and treats it as an error.

  • Resource configuration errors

If any process is referencing the BSR resource BACKING DEVICE, attaching the device will fail with the following error. 16 The error code corresponds to EBUSY.

Info

[open_backing_dev] [DRIVER:140] bsr_erro<3> bsr bsr0/0 bsr0: Failed to open("/dev/sdb1") backing device with -16

  • DOWN Errors

If the number of references to the device is not zero at the time of resource down, down will fail and the following log will be output.

Info

bsr volume(r0) Secondary failed (error=0: State change failed: (-12) Device is held open by someone

To resolve this issue, you must need to identify the process or module that is referencing the BSR volume and take steps to reduce the number of references to the volume, such as disabling forcing the behavior of that modulemodule to stop (fuser -ck).

For example, on Ubuntu 20.04 and later, multipath-tools (0.8.3) must be prevented from making references to the bsr volume through have the following exception handling in multipath.conf . Otherwise, to prevent references to the bsr device from occurring, otherwise multipath-tools will continue to reference the bsr volume, making it impossible to bring down.

Code Block
blacklist {
        devnode "^(sd)[a-z]"
        devnode "^(bsr)[0-9]"
        }