Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  • passthrough The default value of on-io-error is to report I/O errors to the upper layer. I/O errors are sent to the mounted file system in the case of the primary node, and error results to the primary node, but in the secondary node errors are ignored (if there is no replication connection). At this time, the disk status is maintained and the error block is recorded in OOS.

    • max-passthrough-count You can't have an infinite number of passthroughs. If more than a certain number of passthroughs are repeated, it should be considered a permanent disk failure, and you specify the threshold here.

  • detach When a lower-level I/O error occurs, the node detaches the backup device from the replication volume and switches to diskless state.

  • call-local-io-error. Call the command defined by the local I/O error handler. This option is only available if local-io-error <cmd> is defined in the <handlers> section of the resource. Using the local-io-error call command or script, I/O error handling is entirely at the user's discretion.

The error handling policy can be applied in real time through the adjust command even if the resource is in operation.

Disk I/O errors happen more often than you might think. This means that BSR replication must be able to flexibly cope with these disk I/O errors from the replication side as well, given that it depends on the lower disk layer, and errors in the SCSI layer can occur at any point in time. The DETACH
Info

Features of passthrough policy

bsr must be able to flexibly respond to errors occurring in the lower SCSI storage layer.

The detach policy, which has been provided as a disk failure policy, is was a policy that unilaterally stops replication in which replication was unilaterally stopped at a certain specific point in time , which is disadvantageous from a service operation point of view, as it perspective. This method is difficult to recover after the fact and is also disadvantageous from a service operation continuation point of view. We devised the PASTHROUGH policy in response to these issues and set it as the default policy for BSR. When the fact and is disadvantageous in terms of continuing service operation.

The pass-through policy records OOS for the corresponding block when an I/O error occurs , the passthrough policy records an OOS for the block and forwards and delivers the failed I/O result to the filesystem. If the filesystem then rewrites the failed block to clear the OOS, this will encourage the filesystem to overcome the transient disk-file system. At this time, if the file system resolves OOS by rewriting the block where the error occurred, this will lead the file system to overcome the temporary disk layer error on its own. Even if the filesystem does not completely resolve the OOS due to its behavior, some of the OOS cannot be completely resolved depending on the operating characteristics of the file system, some remaining OOS can be resolved by resynchronization, such as by retrying the connection. In other words, the resynchronizing through connection retries, etc.

The pass-through policy encourages guides the FS to resolve the error block by itself error blocks on its own or through synchronization, and essentially guarantees basically ensures that the service will continue to operate service operation continues even if there is a are disk I/O problemproblems.

Temporary error handling

If the I/O error policy is set to passthough, the result of the I/O error is transferred to the file system, and the bsr records the I/O block in error as OOS. If the I/O error was temporary, the OOS is retried by the file system and resolved by itself. Otherwise, the OOS area is resolved by inducing resynchronization through replication reconnection. At this time, it is left to the administrator or the HA operation logic to manage operations such as the direction of synchronization to resolve the OOSto resolve the OOS.

It is possible to automatically perform resource detaching by using detach policy, but it is recommended to set the passthrough policy rather than the detach. The passthrough policy is more reasonable in terms of service operation as it can actively respond to temporary failures on disk.

Info

bsr cannot determine if an I/O error is temporary or permanent, and it does not collect specific error statistics such as sector-by-sector I/O error tracking. Instead, only the number of I/O error occurrence information for the device in which the error occurred is maintained. Although the specific information is not shown, the degree of error can be determined based on the number of I/O errors in bsradm status. If a temporary error occurs, the io error and oos shown in status will only be recorded in a small amount and will not increase over time.

...

If this permanent error occurs, the administrator must reconfigure replication or replace the disk. To fix a permanent disk failure, you must first down the resource and then detach the disk from the replication state. Of course, you must reinitialize the metadisk for any replication resource that needs reconfiguration.It is possible to automatically perform resource detaching by using detach policy, but it is recommended to set the passthrough policy rather than the detach. The passthrough policy is more reasonable in terms of service operation as it can actively respond to temporary failures on disk.

Disk replacement

Regenerate the meta data set and reconnect the resources as follows. If necessary, perform a full synchronization by explicitly executing the synchronization command.

...

The following is an example of a configuration file with SB auto recovery.

Code Block
resource <resource> {
    handlers {
        split-brain "C:/Tools/script/split-brain.bat";
        ...
    }
    net {
        after-sb-0pri discard-zero-changes;
        after-sb-1pri discard-secondary;
        ...
    }
    ...
}
Info

The files used in the configuration file of the SB handler must be described as absolute paths.

...

BSR internally maintains reference count for backing devices. The reference count is incremented when a device is opened by a particular process and decremented when it is closed to manage the lifecycle for the device and to force a particular process to no longer reference the device volume when it is time to clean up its resources.

Problems arise when the reference count is non-zero, either at the beginning of configuring the BSR resource or during the process of umounting the device. Because some process opened the device and never closed it, BSR detects this and treats it as an error.

...