Working
- 1 Create Resource
- 2 Initialize Meta
- 3 Resource up
- 4 Promotion
- 5 Demotion
- 6 Resource down
- 6.1 demotion
- 6.2 disconnect
- 6.3 detach
- 6.4 Release resource
- 7 Reconfigurations
- 8 Delete resource
- 9 Inquiry
- 9.1 Version
- 9.2 Status Information
- 9.3 Events
- 10 Efficient synchronization
- 11 Adjusting the synchronization speed
- 12 Congestion mode
- 13 Disk flush
- 14 Consistency verification
- 14.1 Traffic integrity check
- 14.2 Online Verification
- 14.2.1 Enable
- 14.2.2 OV run
- 14.2.3 Automatic verification
- 15 Persist Role
- 16 One-way replication
Describes the overall management tasks from creating to deleting resources, and the main settings in the configuration file.
Create Resource
Resource creation is the preparation of the resource configuration file described in the previous section. If you have created a configuration file, it is assumed that you have created a resource. In bsr, this process must be performed manually by the user. No separate CLI or API is provided for this.
Once the resource is created by writing the configuration file once, the resource remains in the created state until the configuration file is deleted, and the resource on the node can be completely deleted only by deleting the configuration file.
Initialize Meta
When the resource is created, the meta disk must be initialized for the first start. Initialization of the meta disk is performed with the following command.
>bsradm create-md r0
initializing activity log
NOT initializing bitmap
Writing meta data...
New bsr meta data block successfully created.
Meta-initialization is the process of initializing additional information necessary for replication on the meta disk, and only needs to be performed once before starting the resource for the first time. If you start the resource without initializing the meta, it will cause abnormal operation.
When meta-initialization is complete, the resource is ready to be started.
Resource up
You can start a resource with the bsradm up command. up internally starts the resource by sequentially performing the following process.
Allocate resources such as memory, worker threads for resources.
Load the replication volume and set the options specified in the configuration file to the resource.
Connect to the other node through the network.
>bsradm up r0
>bsradm status r0
r0 role:Secondary
disk:Inconsistent
node0 role:Secondary
peer-disk:Inconsistent
The status of bsr can be retrieved with the bsradm status command. For more information on this, please refer to the inquiry.
Allocate resource
Allocate and initialize memory and worker threads for resources.
Attach volume
The volume configured in the resource is attached to the replication volume, the information of the meta disk is queried and loaded, and the options set in the configuration file are applied. attaching can also be performed individually with a separate bsradm attach command.
connect to peer node
Associate the attached volume with the peer node's resource volume. When the connection is established, the replication status becomes Established and standby to start replication. If the peer node has not yet prepared a connection, the local resource remains in the Connecting state. Replication connections can also be done with individual bsradm connect commands.
When the process of resource up is all performed sequentially according to the above procedure, it is regarded as a success in resource startup. If some procedures in the process of resource startup fail, resource startup may be interrupted. In this case, you can check the status of the resource and identify the problem through bsr log and error message.
Promotion
Resources can have primary or secondary roles. Resources in the primary role can read and write data by accessing the volume device without restriction, but resources in the secondary role completely block access to the volume device from the user layer and reflect only the data received from the primary to the device.
When the resource is started, the default role is Secondary and can be switched to the Primary role by user command. This is called promotion.
bsrdadm primary <resource>
Initial synchronization
When the resource is initially started, the disk is in the Inconsistent state of both nodes, and this state is basically a state where the disk cannot be promoted. Therefore, the initial promotion of the resource is promoted by the forced option, and after the forced promotion, the initial synchronization is automatically performed. Forced promotion is performed as follows.
About mount operation
There is a difference between Windows and Linux OS for mount behavior. In Linux, the mounting process to use the volume is required manually, but in the Windows environment, mounting of the volume is performed automatically at the shell level of the operating system, so no separate mount command is required. Therefore, Linux requires an additional mount operation to use the volume after promotion.
bsr defaults to FastSync, which synchronizes only the areas used in the file system. However, if the file system of the replication volume is already damaged for some reason, FastSync based on the damaged information of the file system becomes impossible. In preparation for this situation, bsr performs an integrity check (fsck) of the file system before performing the initial synchronization, and if the file system is broken, the initial synchronization fails.
In this case, you will need to manually recover the file system and try to initialize the resource again.
Demotion
The transition from Primary role to Secondary role is called demotion.
Unmounting and demotion of resources entails switching the role to Secondary as the heaviest task among the command operations of bsr, and reflecting all data pending replication to the target side. This is the basic operation structure for matching data consistency between the replication source and the target. This operation ensures data consistency between the primary and secondary at the time of demoting. Therefore, in the process of unmounting and demoting, it is necessary to keep in mind a certain amount of latency that is required to reflect all pending data to the target.
Resource down
You can stop the resource with the bsradm down command. down stops in the reverse order of the up process described above, and if the resource was in the promoted state, demotes first. In short, resource demotion, replication disconnection, volume detach, and resource release in the following order.
demotion
If the resource was promoted, demote it first.
disconnect
The replication is stopped by disconnecting the connection. Disconnection can also be performed with the disconnect individual command.
If synchronization or replication is in progress and the replication is attempted to be disconnected, the disconnection may be suspended for a period of time. This is because the command to cut off replication is delivered to both the local and peer nodes, so if there is a large amount of data already buffered for replication, the command delivery may be delayed depending on the sequential processing structure. If you want to ignore this delay, you can also force a disconnect locally by using the --force option. If the connection is forcibly disconnected, the connection can be quickly disconnected, but all pending data for replication or synchronization will be discarded, so it should be considered that out-of-sync (OOS) of the data can occur.
detach
Detach the volume that was loaded as a clone volume and record the relevant information on the metadisk. Detach can be done as a separate command, but detach of the primary resource's volume is not allowed.
Release resource
Frees all memory, threads allocated for the resource.
Reconfigurations
bsr basically support changing the resource properties of bsr in operation (runtime). This is called dynamic setting. However, some of these essential properties do not support dynamic settings and must be reconfigured in a static way to restart and apply resources after changing the settings in the configuration file. In other words, in case of static setting, resource restart is required.
Dynamic settings
Change the configuration file and make real-time changes through the bsradm adjust command. Most properties, except some special settings, such as the replication protocol, can be changed in this way.
Static settings
If you need to change the essential settings (node ID, volume information, etc.) for the replication configuration, you must change the settings after resource down. After changing the configuration file, up again to reflect the changed settings at the time the resource is restarted.
Full reconfigurations
If you need to completely change the configuration or recover from a disk failure, you must reconfigure the entire resource. In this case, you must first down the running resource, then change the configuration and perform meta-reinitialization to restart the resource.
Initializing the meta disk will require you to redo the initial synchronization of the volume.
Resizing volume
The volume of the configured resource may need to be expanded or shrunk depending on the operational situation. To do this, you need to use a separate method to resize the replication volume: Resizing volume varies by platform, supports for only online growing volume. but shrinking volume must follow the full reconfiguration's working procedures.
Windows
To growing the volume size of both nodes during replication operation in Windows, you must first disconnect the replication and bring both nodes to Primary. In the secondary state, the volume cannot be resized because the volume is locked with a bsr. Since both nodes are promoted to Primary, the replication cluster enters the split-brain state, and after performing the operation of resizing the volume, demote the node that was the original Secondary, and then resolve the split-brain by using the Secondary node as a victim node.
This increases the size of the entire volume and synchronizes by source as much as the newly increased volume area, allowing online growing volume. Of course, the increased target volume size should be at least larger than the source.
Linux
To perform online growing volume on Linux, the following conditions must be met:
bsr's block device must be configured with a volume manager such as LVM.
The source and target nodes must remain connected to the mirror connection.
Put the node in Primary state, increase the volume of both nodes through LVM, and issue the following command on one node to recognize the newly increased size in bsr.
A new resync is in progress for the increased area of the volume.
Delete resource
The resource is deleted by deleting the configuration file. In normal operation, resources are deleted through the following procedure.
Down the running resource.
For Windows, release the lock on the volume via bsrcon /m.
Delete the resource configuration file.
Inquiry
Version
Check the version information of bsr through the bsradm / V command.
Status Information
Print out basic status information.
Print detailed information.
Print the network connection status.
Connection status and replication status are indicated separately. The connection status changes from StandAlone to Connecting until both nodes are connected. After the connection is established, the connection status is maintained as Connected, and the replication status is changed from Established to various status depending on the operation status.
The replication state can have only one state at a time, especially if the node is in the source state, the peer node must be in the target state.
The following is the role of resource
Resources have one of the following roles:
Primary. It can be read and written. Only one node within a cluster can have this role.
Secondary. Disk changes are updated from the primary node, and are not readable or writable. A role that can be held on one or multiple nodes.
Unknown. The role of the resource is unknown. Used in disconnected mode to indicate the role of the peer node, not used to indicate the role of the local node.
The following is the disk status.
Local and remote disks have one of the following values:
Diskless. The local block device is not assigned to the bsr driver. This state is when the resource has never been attached on the backup device, has been manually detached with the bsradm detach <resource> command, or has been automatically detached after a lower-level I / O error.
Attaching. Transient state while reading metadata.
Failed. This is a temporary state according to the I/O failure report of the local block device. The next state is Diskless.
Negotiating. This is temporarily made when attachment is executed on an already connected device.
Inconsistent. Data is inconsistent. If you have configured new resources, the disks on both nodes will be in this state. Or, the disk status of the target node being synchronized.
Outdated. The data in the resource matches, but it is out of date.
DUnknown. Used to display the status of the remote disk when network connection is unavailable.
Consistent. In the process of connecting nodes, data is a transient state that is considered a match. When the connection is complete, it is determined whether it is UpToDate or Outdated.
UpToDate. Data consistency is consistent and up to date. This is the normal state during replication.
Events
You can check the real-time event status with the following command. The bsrsetup events2 command can be used with the '--statistics', '--timestamp' options.
Efficient synchronization
bsr provides various functions such as FastSync, checksum-based synchronization, truck-based synchronization, and bitmap clear synchronization for efficient synchronization.
Fast Synchronization
bsr changed the existing full synchronization method that performs for the entire disk area to FastSync, which synchronizes only the area used by the file system. For example, if you are only using 100MB on a 1TB disk, initial synchronization can be completed 10 times faster than the existing full synchronization (1TB) because only 100MB disk area is synchronized. FastSync operates at the following times.
Initial full synchronization (bsradm primary --force)
Manual full synchronization (invalidate/invalidate-remote)
Online Verify check (bsradm verify)
Checksum-based synchronization
Checksum data summarization can further improve the efficiency of bsr's synchronization algorithm. Checksum-based synchronization reads blocks before synchronization, obtains a hash summary of the contents on the current disk, and then reads the same sector from the other node and compares it with the obtained hash summary. If the hash match, the re-write for the block is omitted, and if they do not match, synchronization data is transmitted. This method can be advantageous in performance compared to the existing method of simply overwriting the block to be synchronized, and if the file system writes the same contents to the sector again while disconnected (disconnected), resynchronization is omitted for that sector. Overall, it have a more shorten synchronization time.
Truck-based synchronization
Truck-based synchronization by directly importing and configuring disks is suitable for the following situations.
Initially, the amount of data to be synchronized is very large (hundreds of gigabytes or more)
When the rate of change of the data to be copied is expected to be small compared to the huge data size
When available network bandwidth between source and target sites is limited
In the above situation, if you do not synchronize by truck-based synchronization and initialize with the normal device synchronization method, it will take a very long time during synchronization.
Let's say one situation. There is a local node that has been disconnected from being in Primary. That is, the device configuration is complete and the same copy of bsr.conf exists on both nodes. Commands for initial resource promotion have been executed on the local node, but the remote node is not connected yet.
Run the following command on the local node.
Create copies of the data to be replicated and the metadata of the data. For example, you could use a hot-swappable drive in the RAID-1 mirror. Of course, in this situation, the RAID set will need to be replaced with a new drive to continue mirroring. However, the disk drive you removed here is a literal copy that can be used elsewhere. If your local block device supports snapshot copy function, you can use it.
Run the following command on the local node. There is no --clear-bitmap option in the second command run.
Configures the same copy of the original data to be physically taken directly for use on remote nodes.
You can directly connect the disk physically, or you can copy the imported data to the existing disk and use it in bit units. This process should be done not only on the mirroring data, but also on the metadata. If such a procedure cannot be accepted, this method cannot proceed.
Start the bsr resource on the remote node.
When both nodes are connected, they will not initiate full device synchronization. Instead, only synchronization of blocks that have changed since the bsradm--clear-bitmap new-current-uuid command was invoked is automatically initiated.
If there is no change, there may be a simple synchronization depending on the area covered in the Activity Log rolled back from the new secondary node.
Bitmap clear synchronization
You can use the option to clear the bitmap (--clear-bitmap) so that it can be quickly sync without an initial full synchronization over a long period of time. The following are examples of these operations.
It can be used to skip the initial sync by creating a new Current UUID and clearing the Bitmap UUID. This use case only works for the metadata just created.
On both nodes, initialize the meta and configure the device. bsradm -- --force create-md res
Start resources of both nodes and recognize each other's volume size at the time of initial handshake. bsradm up res
When both nodes are connected as Secondary / Secondary, Inconsistent / Inconsistent, create a new UUID and clear the bitmap. bsradm new-current-uuid --clear-bitmap res
Now both nodes are in Secondary / Secondary, UpToDate / UpToDate state, and promote one side to Primary to create a file system. bsradm primary res mkfs -t fs-type $(bsradm sh-dev res)
One obvious side effect of this approach is that the replica is full of old garbage (unless you make it the same using other methods), it is expected to find the number of unsynchronized blocks when online verification. This method should never be used in situations where the volume already has data. At first glance it may seem to work, but once you switch to another node, the data that was already there is not replicated, so the data is broken.
Adjusting the synchronization speed
When synchronization is running in the background, the data on the target is temporarily in an inconsistent state. This inconsistent state should be kept as short as possible to ensure consistency, so it is beneficial to have a high enough synchronization rate. However, replication and synchronization share the same network band, and if the synchronization band is set high, replication will be given relatively little bandwidth. If the replication band is lowered, it will affect local I/O latency and result in local I/O performance degradation. If either replication or synchronization unilaterally occupies a lot of bandwidth, it will affect the behavior of the other, so we implement variable-band synchronization, which guarantees the replication band as much as possible while moderating the synchronization band according to the replication situation, and this is the default policy. In contrast, the fixed-band synchronization policy, which guarantees the synchronization band at all times regardless of replication, can cause local I/O performance degradation if used during server operation, so it is not generally recommended and should be used only in special situations.
Fixed rate synchronization
The maximum bandwidth used for resynchronization in the background is determined by the resource's resync-rate option. These options are included in the disk section of the /etc/bsr.conf resource configuration as follows:
The resync-rate and c-min-rate settings are specified in bytes per second. The default unit is Kibibyte, and the value of 4096 is interpreted as 4 MiB.
Variable rate synchronization
Configuring with fixed-bandwidth synchronization is problematic for configurations where multiple resources share a replication/synchronization network. Because the resources share the same replication network, if the synchronization rate is occupied for a particular replication resource channel, the other resources are not guaranteed a fixed synchronization rate. In this case, variable bandwidth synchronization can be configured to dynamically adjust the synchronization rate of each replication channel to proactively adjust the synchronization band in response to other resources taking over. Variable-band synchronization determines an initial synchronization rate (by resync-rate) and then uses an automatic control algorithm to continuously adjust the synchronization rate. This algorithm ensures that the synchronization band is available from c-min-rate to c-max-rate while still allowing replication to operate on the front end. Setting c-max-rate too high will affect the replication band, so it is preferable to set it to match the network band.
The optimal configuration for variable bandwidth synchronization depends on the available network bandwidth, application I/O patterns, and replication link congestion, and the optimal configuration settings may vary depending on whether you are using Replication Accelerator (DRX).
Set the synchronization ratio
You can also set the synchronization rate as a percentage of the replication bandwidth.
The example above sets the synchronization band to a ratio of 3 replication to 1 synchronization (4 total). However, the sync ratio is compared to the c-min-rate and if the c-min-rate is higher, it is applied as the c-min-rate value. This ensures that you have the minimum amount of synchronization bandwidth.
Congestion mode
In environments where the replication band is a variable band (WAN), the replication link can sometimes become congested. This causes the primary node's I/O to wait, resulting in a performance degradation of local I/O. Congestion mode is a configuration to respond to this situation.
When congestion is detected, replication is suspended and buffered data is slowly sent to the target while logging local I/O to OOS. During this process, the primary is in an Ahead data state compared to the secondary, and once it finishes sending buffered data, it automatically enters sync mode to synchronize the OOS areas that failed to replicate.
Here is an example of setting up a congestion policy
In the resource configuration file, set the congestion mode with the on-congestion option item and set the congestion detection threshold with the congestion-fill item.
The pull-ahead option is used with congestion-fill, congestion-extents, or congestion-highwater. The recommended values for each property are as follows
Set congestion-fill to approximately 90% of the size of sndbuf-size. If you are integrating a replication accelerator (DRX), set it to 90% of the DRX buffer. However, if the buffer is allocated a large size, say 10GB or more, the 90% threshold may be too large to detect congestion, so this should be adjusted to a reasonable value through tuning.
The recommended value for congestion-extents is 90% of the al-extents setting.
congestion-highwater detects congestion based on packet count. It is particularly appropriate for use in DR environments where capacity-based detection of replication congestion is not suitable. It is set to 20000 by default and is enabled by default. It is disabled when set to 0 and has a maximum value of 1000000.
Disk flush
If the target node suddenly goes down due to power failure during replication, data loss may occur if the disk cache area is not backed up by a battery backup device (BBWC). In order to prevent this in advance, in the process of writing data to the disk of the target, after data is written to the media, the flush operation is always performed to prevent data loss.
The storage device equipped with BBWC does not need to perform the disk flush operation, so it provides an option to disable the flush as follows.
You should disable device flushing only when running bsr on devices with battery backup write cache (BBWC). Most storage controllers automatically disable the write cache when the battery is exhausted and switch to write through mode when the battery is exhausted.
Consistency verification
Consistency verification is a function that performs replication traffic in real-time in block units during replication or compares block-by-block based on hash summaries to verify that the source and target data are completely matched in whole (used) disk volume units.
Traffic integrity check
bsr can use cryptographic message digest algorithms to verify message integrity between both nodes. When this function is used, bsr generates a message summary of all data blocks, delivers it to the other node, and verifies the integrity of the replication packet at the other node. If the summarized blocks do not match each other, retransmission is requested.
When replicating data, bsr can protect the source data against the following error conditions through this consistency check, and failure to respond to such situations can potentially cause data corruption during replication.
Bit errors (bit flips) that occur in data transferred between main memory and the network interface of the transmitting node.
If the TCP checksum offload function provided by LAN Card is recently activated, hardware bitflip may not be detected by software.
Bit errors that occur on data being transferred from the network interface to the receiving node's main memory (the same applies for TCP checksum offloading).
Damage due to a bug or race condition within the network interface firmware or driver.
Bit flips or random damage injected by the recombination network component between nodes (if not using direct connection, back-to-back connection).
Replication traffic consistency checking is disabled by default. To enable it, add the following to the resource configuration in /etc/bsr.conf.
<algorithm> is a message hashing compression algorithm supported by the kernel cryptography API in the system's kernel configuration. On Windows, only crc32c is supported.
After changing the resource configuration of both nodes identically, execute bsradm adjust <resource> on both nodes to apply the changes.
Online Verification
Online Verification is a function to check the consistency of block-specific data between nodes during service is online . it does not duplicate check, and it is basically used to efficiently use network bandwidth and check the area used by the file system.
The online verification sequentially encrypts all data blocks on a specific resource storage at one node (verification source), and then sends the summarized content to a verification target to summarize the contents of the same block location and compare it. If the summarized content does not match, the block is marked out-of-sync and is later synchronized. Here, network bandwidth is effectively used because only the smallest summary is transmitted, not the entire contents of the block.
Since the operation to verify the consistency of the resource is checked during operation, there may be a slight decrease in replication performance when online verification and replication are performed simultaneously. However, there is an advantage that there is no need to stop the service, and there is no downtime of the system during the scan or synchronization process after the scan.
Generally, it is common practice to perform tasks according to online verification as scheduled tasks in the OS and perform them periodically during periods of low operational I/O load.
Enable
Online verification is disabled by default, but can be activated by adding the following entry to the resource configuration in bsr.conf.
algorithm means the message hashing algorithm, and only supports crc32c in Windows.
To enable online verification, make the same resource configuration changes on both nodes, then run bsradm adjust <resource> on both nodes to apply the changes.
OV run
After enabling online verification, you can run the test using the following command:
When an online verification is executed, bsr finds and displays the unsynchronized block in <resource> and records it. At this time, all applications that use the device can operate without any restrictions, and the role of the resource can also be changed.
The verify command performs a verification after changing the disk status to UpToDate. Therefore, it is desirable to perform UpToDate on the replication source node side after the initial sync is completed. For example, if you perform verification on the disk node side of the Inconsistent state, the disk state is changed to UpToDate, which may cause operational problems.
If an out-of-sync block is detected while verification is running, after verification is complete, you can synchronize with the next command. At this time, the direction of synchronization is from the primary node to the secondary direction, and synchronization is not performed in the secondary/secondary state. Therefore, in order to solve the OOS due to online verification, promotion to the primary on the source side node is required.
Automatic verification
If you need to do a regularity check, register the bsradm verify <resource> command to the task scheduler in the following way.
First, create a script file with the following contents in a specific location on one of the nodes.
To verify all resources, you can use the all keyword instead of <resource>.
The following is an example of creating a scheduled task using schtasks (windows schedule setting command). With the following settings, online verification is performed every Sunday at 00:42 AM.
Persist Role
While resource roles can be changed based on operational circumstances, sometimes you may want to persist roles. (BSR 1.7.3 and later)
A resource with persist-role set will continue to have the resource role explicitly specified (with the bsradm command) at the time of restart. This works in any situation where the replication service or system reboots, causing the resource to restart.
One-way replication
If you always want to have only one-way replication from the primary node to the standby node, without swtichover or failover, consider the target-only attribute on the standby node side. (BSR 1.7.3 and later)
Set the persist-role attribute described above in the resource options section to fix the roles of the primary and standby nodes.
Set the target-only attribute on the standby node side to force the replication/synchronization direction from the primary node to the standby node only.
A target-only node is prohibited from acting as a source in all replication/sync operations, including explicit commands, and can only have a target role; any manual synchronization or promotion commands that act as a source are blocked (but promotion is allowed on disconnection).