bsr (Block Sync & Replication) is a solution for real-time replication of block devices on the host in software over the network.

Based on Windows DRBD (wdrbd) forked from drbd (http://www.drbd.org), it is is an open source project built as a cross-platform common engine to support both Windows and Linux platforms. It inherits all the basic concepts and functions of Windows DRBD, and provides solutions to build a more stable and efficient replication environment by supplementing the problems and functions of DRBD9.

bsr is free to contribute and participate in development through the open source community (bsr) (For technical support or inquiries regarding bsr, please contact to github issues for bsr or Mantech's bsr@mantech.co.kr )

bsr is distributed under the GPL v2 license.

Basic

bsr replicates in the following way:

The application writes data to the block device and replicates it in real time.
Real-time replication does not affect other application services or system elements.
Replicate synchronously or asynchronously
- The synchronous protocol treats replication as complete when data is written to the local disk and the target host disk.
- The asynchronous protocol treats replication as completed when data is written to the local disk and data is transferred to socket’s tx buffer.

Kernel Driver

The core engine of bsr was developed as a kernel driver.

The kernel driver is located at the disk volume layer, and controls write I/O from the file system in block units. And as it performs replication in the lower layer of the file system, it is suitable for high availability by providing a transparent replication environment independent of the file system and application. However, since it is located in the lower layer of the file system, it cannot control general operations related to files. For example, it is not possible to detect damage to the file system or control pure file data. It is only replicate in blocks written to disk.

bsr provides Active-Passive clustering by default, not support Active-Active clustering.

Management tools

bsr provides management tools for organizing and managing resources. It consists of bsradm, bsrsetup, bsrmeta, and bsrcon described below. Administrator-level privileges are required to use administrative commands.

bsradm

It is a utility that provides high-level commands abstracting detailed functions of bsr. bsradm allows you to control most of the behavior of bsr.
bsradm gets all configuration parameters from the configuration file etc/bsr.conf, and serves to pass commands by giving appropriate options to bsrsetup and bsrmeta. That is, the actual operation is performed in bsrsetup, bsrmeta.
bsradm can be run in dry-run mode via the -d option. This provides a way to see in advance which bsradm will run with which combination of options without actually invoking the bsrsetup and bsrmeta commands.
For more information about the bsradm command option, refer to bsradm in the Appendix B. System Manual.

bsrsetup

bsrsetup can set option values for the bsr kernel engine. All parameters to bsrsetup must be passed as text arguments.
The separation of bsradm and bsrsetup provides a flexible command system.
- The parameters accepted by bsradm are replaced with more complex parameters, and then invoke bsrsetup.
- bsradm prevents user errors by checking for grammatical errors in resource configuration files, etc., but bsrsetup does not check for these grammatical errors.
In most cases, it is not necessary to use bsrsetup directly, but it is used to perform individual functions or individual control between nodes.
For details on the bsrsetup command options, refer to bsrsetup Appendix B. System Manual.

bsrmeta

Create metadata files for replication configuration, or provide dump, restore, and modification capabilities for meta data. In most cases, like bsrsetup, you do not need to use bsrmeta directly, and control the metadata through the commands provided by bsradm.
For details on the bsrmeta command options, refer to bsrmeta of Appendix B. System Manual.

bsrcon

Check bsr related information or adjust other necessary settings.
For more information about the bsrcon command option, refer to BSrcon in Appendix B. System Manual.

Resource

Resources are the abstraction of everything needed to construct a replicated data set. Users configure a resource and then control it to operate a replication environment.

In order to configure resources, the following basic options (resource name, volume, network connection) must be specified.

Resource Name

Name it in US-ASCII format without spaces.

Volume

A resource is a replication group consisting of one or more volumes that share a common replication stream, ensuring write consistency of all volumes in the resource.
The volume is described as a single device and is designated as a drive letter in Windows.
One replication set requires one volume for data replication and a separate volume for storing metadata associated with the volume. Meta volumes are used to store and manage internal information for replication.
- Metadata is divided into External Meta Type and Internal Meta Type according to the storage location. For example, if meta data is located on the disk of the replication volume, it is the internal meta, and if it is located on another device or another disk, it is the external meta.
- In terms of performance, the external meta type is advantageous over the internal meta, because bsr can perform replication I/O and write metadata simultaneously during operation. And since the I/O performance of the meta disk directly affects the replication performance, it is recommended to configure the disk as good as possible.
- Note that the volume for the meta should be configured in RAW without formatting to a file system(such as NTFS).

Network Connection

Connection is a communication link for a replication data set between two hosts.
Each resource is defined as multiple hosts, including full-mesh connections between multiple hosts.
The connection name is automatically assigned as the resource name at the bsradm level, unless otherwise specified.

Resource Role

Resources have the role of a Primary or Secondary.

Primary can read and write without limitation on resources.
Secondary receives and records any changes made to the disk from the other node and does not allow access to the volume. Therefore, the application cannot read or write to the secondary volume.

The role of a resource can be changed through the bsr utility command. When changing the resource role from Secondary to Primary, it is called Promotion, and the opposite is Demotion.

Main Features

Replication Cluster

bsr defines a set of nodes for replication as a replication cluster, and basically supports a single primary mode that can act as a primary resource on only one node of the replication cluster member. Dual or multiple primary modes are not supported. The single primary mode, the Active-Passive model, is the standard approach to handling data storage media in a highly available cluster for failover.

Replication Protocol

bsr supports three replication methods.

Protocol A. Asynchronous

The asynchronous method considers replication complete when the primary node finishes writing to the local disk and simultaneously writes to the send buffer of TCP. Therefore, this method is used locally when fail-over, but the data in the buffer may not be able to completely pass to the standby node. The data on the standby node after transfer is consistent, but some unsuccessful updates to the writes that occurred during transfer may be lost. This method has good local I / O responsiveness and is suitable for WAN remote replication environments.

Protocol B. Semi Synchronous

In the case of a semi-synchronous method, when a local disk write occurs on the primary node, replication is considered complete when the replication packet is received from the other node.

Normally, data loss does not occur during a fail-over, but if both nodes are powered off simultaneously or irreparable damage occurs in the primary storage, the most recently recorded data in the primary may be lost.

Protocol C. Synchronous

The synchronous method is considered complete when the primary node has completed writing to both the local and remote disks. This ensures that no data is lost in the event of loss on either node.

Of course, loss of data is inevitable if both nodes (or a node's storage subsystem) suffer irreparable damage at the same time.

In general, bsr uses the Protocol C method a lot.

The replication method should be determined by data consistency, local I / O latency performance, and throughput.

Synchronous replication completely guarantees the consistency of the active and standby node, but because the local I/O is completed after writing to the standby node for one write I/O, the local I/O latency There is a performance penalty. Depending on the I/O depth, latency can be reduced from several times to as many as tens of times or more, and on a throughput basis, it averages 70 MB/s on a 1 Gbps network.

For an example of configuring the replication mode, refer to create resources.

Replication Transport Protocol

bsr's replication transport network supports the TCP/IP transport protocol.

TCP(IPv4/v6)

It is the basic transport protocol of bsr and is a standard protocol that can be used on all systems that support IPv4/v6.

Fast Synchronization

bsr implements FastSync, which synchronizes only the part being used in the volume. If FastSync does not work, synchronization is required for the entire volume area, so if the volume is large, it takes a lot of synchronization time. FastSync is a powerful feature of bsr that can save a lot of synchronization time.

Efficient synchronization

In bsr, replication and (re)synchronization are separate concepts. Replication is a process that reflects all disk write operations of the resource of the primary role in real time to a secondary node, and synchronization is a process of copying block data from the perspective of all block devices excluding real-time write I/O. Replication and synchronization work individually, but they can be processed simultaneously.

If the connection between the primary and secondary is maintained, replication continues. However, if the replication connection is interrupted due to a failure of the primary or secondary node, or the replication network is disconnected, synchronization between the primary and secondary is required.

When synchronizing, bsr does not synchronize blocks in the order in which the original I/O was written to disk. Synchronization sequentially synchronizes only the areas that are not synchronized from sector 0 to the last sector based on the information in the metadata and efficiently processes as follows.

Synchronization is performed block by block according to the block layout of the disk, so disk search is rarely performed.
It is efficient because it synchronizes only once for blocks in which multiple writes have been made in succession.

During synchronization, some of the Standby node's entire dataset is past and some are updated to the latest. The status of this data is called Inconsistent, and the status that all blocks are synchronized with the latest data is called UpToDate. Nodes in an inconsistent state are generally in a state where the volume is not available, so it is desirable to keep this state as short as possible.

Of course, even if synchronization is performed in the background, the application service of the Active node can be operated continuously with or without interruption.

Fixed-rate synchronization

In fixed-rate synchronization, the data rate synchronized to the peer node can be adjusted within the upper limit in seconds (this is called the synchronization rate), and can be specified as the minimum (c-min-rate) and maximum (c-max-rate).

Variable-rate synchronization

In Variable-rate synchronization bsr detects the available network bandwidth and compares it to I/O received from the application, automatically calculates the appropriate synchronization speed. bsr defaults to variable-band synchronization.

Checksum-based synchronization

Checksum data summarization can further improve the efficiency of the synchronization algorithm. Checksum-based synchronization reads blocks before synchronization, obtains a hash summary of what is currently on disk, and then compares the obtained hash summary by reading the same sector from the other node. If the hash values match, the re-write of the block is omitted. This method can be advantageous in performance compared to simply overwriting a block that needs to be synchronized, and if the file system rewrites the same data to a sector while disconnected (disconnected), resynchronization is skipped for that sector, so you can shorten the synchronization time in overall.

Congestion Mode

bsr provides a congestion mode function that can actively operate by detecting the congestion of the replication network during asynchronous replication. The congestion mode provides three operation modes: Blocking, Disconnect, and Ahead.

If you do not set anything, it is basically a blocking mode. In Blocking mode, it waits until there is enough space in the TX transmission buffer to transmit duplicate data.
The disconnect mode can be set to temporarily relieve the local I/O load by disconnecting the replication connection.
Ahead mode responds to congestion by automatically re-synchronizing when congestion is released by first writing the primary node's I / O to the local disk while maintaining the replication connection and recording the area as out-of-sync. The primary node that is in the Ahead state becomes the Ahead data state compared to the Secondary node. And at this point, the secondary becomes the behind data state, but the data on the standby node is available in a consistent state. When the congestion state is released, replication to the secondary automatically resumes and background synchronization is performed automatically for out-of-sync blocks that were not replicated in the Ahead state. The congestion mode is generally useful in a network link environment with variable bandwidth, such as a wide area replication environment through a shared connection between data centers or cloud instances.

Online Verification

Online Verification is a function that checks the integrity of block-specific data between nodes during device operation. Integrity checks efficiently use network bandwidth and do not duplicate checks.

Online Verification is a cryptographic digest of all data blocks on a specific resource storage in one node (verification source), and the summary is compared to the contents of the same block location by transmitting the summarized content to a verification target. To do. If the summarized ㅗhash does not match, the block is marked out-of-sync and is later synchronized. Here, network bandwidth is effectively used because only the smallest summary is transmitted, not the entire block.

Since the operation of verifying the integrity of the resource is checked online, there may be a slight decrease in replication performance when online checking and replication are performed simultaneously. However, there is an advantage that there is no need to stop the service, and there is no downtime of the system during the scan or synchronization process after the scan. And since bsr executes FastSync as the basic logic, it is more efficient by performing an online scan only on the disk area used by the file system.

It is common practice to perform tasks according to Online Verification as scheduled tasks at the OS level and periodically perform them during times of low operational I / O load. For more information on how to configure online integrity checking, see Using on-line device verification.

Replication traffic integrity checking

bsr can use cryptographic message digest algorithms to verify the integrity of replication traffic between nodes in real time.

When you use this feature, Primary verifies the integrity of the replication traffic by generating a message digest of all data blocks and passing it to the Secondary node. If the summarized blocks do not match, request retransmission. bsr protects the source data against the following error conditions through this integrity of replication traffic. If you don't respond in advance to these situations, potential data corruption during duplication can occur.

Bit errors (bit flips) occurring in data transferred between the main memory and the network interface of the transmitting node (If the TCP checksum offload function provided by the latest LAN card is activated, these hardware bit flips may not be detected by software).
Bit errors that occur on data transferred from the network interface to the receiving node's main memory (the same consideration applies to TCP checksum offloading).
Bugs or race conditions in the network interface firmware and drivers.
Bit flips or random corruption injected by linked network components between nodes (if direct connections or back-to-back connections are not used).

Split-Brain notification and recovery

Split brain refers to a situation in which two or more nodes have a primary role due to manual intervention of the cluster management software or administrator in a temporary failure situation in which all networks are disconnected between cluster nodes. This is a situation that suggests that modifications to the data have been made on each node rather than being replicated to the other side, which can lead to potential problems. Because of this, the data may not be merged and two data sets may be created.

bsr provides the functions to automatically detect and recover split brains. For more information on this, see the split brain topic in Troubleshooting.

Disk error handling strategies

In the event of a disk device failure, bsr prescribes the disk failure policy to simply pass the I / O error to the upper layer (mostly the file system) for processing or detach the replication disk to stop. The former is a passthrough policy, the latter is a detach policy.

passthrough

When an error occurs in the lower disk layer, the error code result is transmitted back to the upper (file system) layer without further processing. Proper handling of errors is left to the upper level. For example, the file system may see the error and retry disk writes or attempt to remount in a read-only. By passing the error to the upper layer in this way, the file system itself recognizes the error and have an opportunity to cope with the error.

detach

If the error policy is configured in the detach method, when the error occurs in the lower layer, the bsr automatically processes the disk to detach. When the disk is detached, it becomes diskless and the I/O to the disk is blocked, so a disk failure must be recognized and a follow-up action is required. In bsr, the diskless state is defined as a state that prevents I/O from entering the disk. More information is discussed in Troubleshooting’s Disk Failure.

Outdated data strategies

bsr distinguishes Inconsistent data and Outdated data. Inconsistent data refers to data that cannot be accessed or used in any way. As a typical example, the data on the target side in synchronization is inconsistent. Some of the target data that is in sync is up-to-date, but some are from the past, so it cannot be considered as one-time data. Also, at this time, the file system that may have been loaded on the device cannot be mounted or the file system cannot be automatically checked.

Outdated disk status is data that is guaranteed to be consistent with the data but is not synchronized with the primary node and up-to-date data. This happens when the replication link goes down, either temporarily or permanently. Since the disconnected Oudated data is the data of the past time, bsr does not basically allow promoting a resource to a node with Outdated data to prevent it from being serviced in this state of data. However, if necessary (in an emergency), you can forcibly promote outdated data. In this regard, bsr has an interface that allows an application to immediately put a Secondary node outdated as soon as a network outage occurs. If the corresponding replication link is reconnected from an outdated resource, the Outdated status flag is automatically cleared and the background synchronization is completed and updated with the latest data (UpToDate).

A secondary node that has a primary crash or a disconnected connection may have a disk status of Outdated.

Overview