Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 13 Next »

Describes the overall management tasks from creating to deleting resources, and the main settings in the configuration file.

Create Resource

Resource creation is the preparation of the resource configuration file described in the previous section. If you have created a configuration file, it is assumed that you have created a resource. In bsr, this process must be performed manually by the user. No separate CLI or API is provided for this.

Once the resource is created by writing the configuration file once, the resource remains in the created state until the configuration file is deleted, and the resource on the node can be completely deleted only by deleting the configuration file.

Initialize Meta

When the resource is created, the meta disk must be initialized for the first start. Initialization of the meta disk is performed with the following command.

>bsradm create-md r0
initializing activity log
NOT initializing bitmap
Writing meta data...
New bsr meta data block successfully created.

Meta-initialization is the process of initializing additional information necessary for replication on the meta disk, and only needs to be performed once before starting the resource for the first time. If you start the resource without initializing the meta, it will cause abnormal operation.

When meta-initialization is complete, the resource is ready to be started.

Resource up

You can start a resource with the bsradm up command. up internally starts the resource by sequentially performing the following process.

  • Allocate resources such as memory, worker threads for resources.

  • Load the replication volume and set the options specified in the configuration file to the resource.

  • Connect to the other node through the network.

>bsradm up r0
>bsradm status r0
r0 role:Secondary
  disk:Inconsistent
  node0 role:Secondary
    peer-disk:Inconsistent

The status of bsr can be retrieved with the bsradm status command. For more information on this, please refer to the inquiry.

Allocate resource

Allocate and initialize memory and worker threads for resources.

Attach volume

The volume configured in the resource is attached to the replication volume, the information of the meta disk is queried and loaded, and the options set in the configuration file are applied. attaching can also be performed individually with a separate bsradm attach command.

connect to peer node

Associate the attached volume with the peer node's resource volume. When the connection is established, the replication status becomes Established and standby to start replication. If the peer node has not yet prepared a connection, the local resource remains in the Connecting state. Replication connections can also be done with individual bsradm connect commands.

When the process of resource up is all performed sequentially according to the above procedure, it is regarded as a success in resource startup. If some procedures in the process of resource startup fail, resource startup may be interrupted. In this case, you can check the status of the resource and identify the problem through bsr log and error message.

Promotion

Resources can have primary or secondary roles. Resources in the primary role can read and write data by accessing the volume device without restriction, but resources in the secondary role completely block access to the volume device from the user layer and reflect only the data received from the primary to the device.

When the resource is started, the default role is Secondary and can be switched to the Primary role by user command. This is called promotion.

bsrdadm primary <resource>

Initial synchronization

When the resource is initially started, the disk is in the Inconsistent state of both nodes, and this state is basically a state where the disk cannot be promoted. Therefore, the initial promotion of the resource is promoted by the forced option, and after the forced promotion, the initial synchronization is automatically performed. Forced promotion is performed as follows.

bsrdadm primary --force <resource>

About mount operation

There is a difference between Windows and Linux OS for mount behavior. In Linux, the mounting process to use the volume is required manually, but in the Windows environment, mounting of the volume is performed automatically at the shell level of the operating system, so no separate mount command is required. Therefore, Linux requires an additional mount operation to use the volume after promotion.

Demotion

The transition from Primary role to Secondary role is called demotion.

bsradm secondary <resource>

On Linux, unmounting of the volume is required before performing a demotion. In the Windows environment, there is no need for a separate unmounting process since the unmount command is performed internally.

Unmounting and demotion of resources entails switching the role to Secondary as the heaviest task among the command operations of bsr, and reflecting all data pending replication to the target side. This is the basic operation structure for matching data consistency between the replication source and the target. This operation ensures data consistency between the primary and secondary at the time of demoting. Therefore, in the process of unmounting and demoting, it is necessary to keep in mind a certain amount of latency that is required to reflect all pending data to the target.

manually fail-over

The procedure for manual transfer is as follows.

  1. Stop all applications or services using the bsr device on the primary node, and demote the resource to secondary (after umounting the volume on Linux).

    bsradm secondary <resource>
  2. Execute the following command on the node you want to promote to primary. Restart the service (after mounting the volume on Linux case).

    bsradm primary <resource>

Resource down

You can stop the resource with the bsradm down command. down stops in the reverse order of the up process described above, and if the resource was in the promoted state, demotes first. In short, resource demotion, replication disconnection, volume detach, and resource release in the following order.

On Linux, umount for volumes must be preceded.

bsradm down <resource>

demotion

If the resource was promoted, demote it first.

disconnect

The replication is stopped by disconnecting the connection. Disconnection can also be performed with the disconnect individual command.

If synchronization or replication is in progress and the replication is attempted to be disconnected, the disconnection may be suspended for a period of time. This is because the command to cut off replication is delivered to both the local and peer nodes, so if there is a large amount of data already buffered for replication, the command delivery may be delayed depending on the sequential processing structure. If you want to ignore this delay, you can also force a disconnect locally by using the --force option. If the connection is forcibly disconnected, the connection can be quickly disconnected, but all pending data for replication or synchronization will be discarded, so it should be considered that out-of-sync (OOS) of the data can occur.

detach

Detach the volume that was loaded as a clone volume and record the relevant information on the metadisk. Detach can be done as a separate command, but detach of the primary resource's volume is not allowed.

In bsr, separating the primary resource's volume is considered a dangerous operation because it leads to a failure. We removed the volume detach of the primary resource at the code level. Of course, detach of the Secondary resource is allowed.

Release resource

Frees all memory, threads allocated for the resource.

Reconfigurations

bsr basically support changing the resource properties of bsr in operation (runtime). This is called dynamic setting. However, some of these essential properties do not support dynamic settings and must be reconfigured in a static way to restart and apply resources after changing the settings in the configuration file. In other words, in case of static setting, resource restart is required.

Dynamic settings

Change the configuration file and make real-time changes through the bsradm adjust command. Most properties, except some special settings, such as the replication protocol, can be changed in this way.

Change replication protocol

To change the replication protocol during operation, the protocol, transmission buffer, and congestion control settings must be changed together.

  • First, delete the peer connection with the bsrsetup del-peer <resource> <node-id> command.

  • Adjust the protocol, congestion control settings and sndbuf-size in both node resource files.

  • Apply as bsradm adjust <resource>.

Static settings

If you need to change the essential settings (node ID, volume information, etc.) for the replication configuration, you must change the settings after resource down. After changing the configuration file, up again to reflect the changed settings at the time the resource is restarted.

Full reconfigurations

  • If you need to completely change the configuration or recover from a disk failure, you must reconfigure the entire resource. In this case, you must first down the running resource, then change the configuration and perform meta-reinitialization to restart the resource.

In the case of Windows, it may be necessary to release the lock on the volume during the entire rebuild process. In this case, you can release the volume lock using the /m option of the bsrcon utility.

  • Initializing the meta disk will require you to redo the initial synchronization of the volume.

Resizing volume

The volume of the configured resource may need to be expanded or shrunk depending on the operational situation. To do this, you need to use a separate method to resize the replication volume: Resizing volume varies by platform, supports for only online growing volume. but shrinking volume must follow the full reconfiguration's working procedures.

Windows

To growing the volume size of both nodes during replication operation in Windows, you must first disconnect the replication and bring both nodes to Primary. In the secondary state, the volume cannot be resized because the volume is locked with a bsr. Since both nodes are promoted to Primary, the replication cluster enters the split-brain state, and after performing the operation of resizing the volume, demote the node that was the original Secondary, and then resolve the split-brain by using the Secondary node as a victim node.

This increases the size of the entire volume and synchronizes by source as much as the newly increased volume area, allowing online growing volume. Of course, the increased target volume size should be at least larger than the source.

As the volume size increases, the size of the meta disk automatically increases (bsr handles it internally at the time of volume expansion). If there is not enough free space, the volume expansion will fail. Therefore, in order to expand the online volume, it is necessary to calculate the meta disk size with this in mind during the initial resource configuration.

Linux

To perform online growing volume on Linux, the following conditions must be met:

  • bsr's block device must be configured with a volume manager such as LVM.

  • The source and target nodes must remain connected to the mirror connection.

Put the node in Primary state, increase the volume of both nodes through LVM, and issue the following command on one node to recognize the newly increased size in bsr.

bsradm resize <resource>

A new resync is in progress for the increased area of the volume.

Delete resource

The resource is deleted by deleting the configuration file. In normal operation, resources are deleted through the following procedure.

  • Down the running resource.

    • For Windows, release the lock on the volume via bsrcon /m.

  • Delete the resource configuration file.

Inquiry

Version

Check the version information of bsr through the bsradm / V command.

[root@bsr-01 nglee]# bsradm -V
BSRADM_BUILDTAG=GIT-hash:3dca67e82d331e95121288a57898fcda13357e94 build by nglee@NGLEE-1,2020-01-29 13:50:48
BSRADM_API_VERSION=2
BSR_KERNEL_VERSION_CODE=0x000000
BSR_KERNEL_VERSION=0.0.0
BSRADM_VERSION_CODE=0x010600
BSRADM_VERSION=1.6.0-PREALPHA3

Status Information

Print out basic status information.

>bsradm status r0
r0 role:Secondary
  disk:UpToDate
  nina role:Secondary
    disk:UpToDate
  nino role:Secondary
    disk:UpToDate
  nono connection:Connecting

Print detailed information.

C:\>bsrsetup status r0 --verbose --statistic
r0 node-id:0 role:Secondary suspended:no
    write-ordering:flush
  volume:0 minor:2 disk:Inconsistent
      size:4096000 read:0 written:0 al-writes:0 bm-writes:0 upper-pending:0
      lower-pending:0 al-suspended:no blocked:no
  WIN2012R2_2 node-id:1 connection:Connected role:Secondary congested:no
    volume:0 replication:Established peer-disk:Inconsistent
        resync-suspended:no
        received:0 sent:0 out-of-sync:0 pending:0 unacked:0

Performance indicator

  • sent (network send).  The amount of network data transmitted to the other node through the network connection. (Kibyte)

  • received (network receive). The amount of network data received from the other node through the network connection. (Kibyte)

  • written (disk write). Net data recorded on the local hard disk. (Kibyte)

  • read (disk read). Net data read from the local hard disk. (Kibyte)

  • al-writes (activity log). The number of updates to the activity log area of metadata.

  • bm-writes (bit map). The number of updates to the bitmap area of the metadata.

  • upper-pending (application pending I/O ). The number of I/Os that are not completed among the I/Os transferred from the upper to bsr and are being processed by bsr.

  • lower-pending (subsystem open count). The number of (unclosed) open times for the local I/O sub-system performed by bsr.

  • pending. The number of requests that were requested from the local node to the other node but were not acked.

  • unacked (unacknowledged). The number of requests that were received by the other node but did not ack.

  • write-ordering (write order). Indicates the current disk writing method.

  • out-of-sync.  Indicates the amount of storage that is currently out of sync. (Kibytes)

  • resync-suspended.  Whether to stop resynchronization. Possible values are no, user, peer, dependency

  • blocked. Local I/O congestion status

    • no: No congestion

    • upper: Congestion on the upper device

    • lower: Lower Disk congestion

  • congested. This flag tells you if the TCP send buffer on the replication connection is over 80% full.

    • yes: congested

    • no: no congested

Print the network connection status.

C:\>bsradm cstate r0
Connected

Connection status

  • StandAlone. The network configuration is not possible because the resource has not yet been connected, the user has disconnected using bsradm disconnect, or has been disconnected for reasons such as authentication failure or split-brain.

  • Disconnecting. This is a temporary state while the connection is lost. Next status: StandAlone

  • Unconnected. This is a temporary state before trying to connect. Next status: Connecting or Connected.

  • Timeout. This is a temporary state due to the communication timeout with the other node. Next status: Unconnected

  • BrokenPipe. This status is displayed temporarily after the connection with the other node is disconnected. Next status: Unconnected

  • NetworkFailure. This status is displayed temporarily after the connection with the other node is disconnected. Next status: Unconnected

  • ProtocolError. This status is displayed temporarily after the connection with the other node is disconnected. Next status: Unconnected

  • TearDown. This is a temporary state indicating that the other node is ending the connection. Next status: Unconnected

  • Connecting. It is waiting for the peer node to be confirmed on the network.

  • Connected. TCP connection established, waiting for the first network packet from the other node.

Replication status

  • Off The connection with the other node is disconnected, or replication is not in progress.

  • Established. It is connected normally. Connection is established, data mirroring is enabled.

  • StartingSyncS. The local node is the source, and full synchronization has been initiated by the user. Next status: SyncSource or PausedSyncS

  • StartingSyncT. The local node is the target, and full synchronization has been started by the user. Next status: WFSyncUUID

  • WFBitMapS. Partial synchronization begins. Next status: SyncSource or PausedSyncS

  • WFBitMapT. Partial synchronization begins. Next status: WFSyncUUID

  • SyncSource. The local node is the source and synchronization is in progress.

  • SyncTarget. The local node is the target and synchronization is in progress.

  • VerifyS. The local node is the source, and on-line device verification is running.

  • VerifyT. The local node is the target, and on-line device verification is running.

  • PausedSyncS. The local node is the source, and synchronization is paused by a dependency on other synchronization tasks to complete or by manual commands (bsradm pause-sync).

  • PausedSyncT. The local node is the target, and synchronization is paused by dependency on the completion of another synchronization operation or by manual command (bsradm pause-sync).

  • Ahead. The local node has reached the network congestion status and is unable to transmit the replication data. (send OOS Info to the peer node)

  • Behind. The partner node has reached the network congestion status and cannot receive the replicated data. (Afterward, switch to SyncTarget state)

Connection status and replication status are indicated separately. The connection status changes from StandAlone to Connecting until both nodes are connected. After the connection is established, the connection status is maintained as Connected, and the replication status is changed from Established to various status depending on the operation status.

The replication state can have only one state at a time, especially if the node is in the source state, the peer node must be in the target state.

The following is the role of resource

C:\Program Files\bsr>bsradm role r0
Primary/Secondary 

Resources have one of the following roles:

  • Primary. It can be read and written. Only one node within a cluster can have this role.

  • Secondary. Disk changes are updated from the primary node, and are not readable or writable. A role that can be held on one or multiple nodes.

  • Unknown. The role of the resource is unknown. Used in disconnected mode to indicate the role of the peer node, not used to indicate the role of the local node.

The following is the disk status.

C:\Program Files\bsr>bsradm dstate r0
UpToDate/UpToDate

Local and remote disks have one of the following values:

  • Diskless. The local block device is not assigned to the bsr driver. This state is when the resource has never been attached on the backup device, has been manually detached with the bsradm detach <resource> command, or has been automatically detached after a lower-level I / O error.

  • Attaching. Transient state while reading metadata.

  • Failed. This is a temporary state according to the I/O failure report of the local block device. The next state is Diskless.

  • Negotiating. This is temporarily made when attachment is executed on an already connected device.

  • Inconsistent. Data is inconsistent. If you have configured new resources, the disks on both nodes will be in this state. Or, the disk status of the target node being synchronized.

  • Outdated. The data in the resource matches, but it is out of date.

  • DUnknown. Used to display the status of the remote disk when network connection is unavailable.

  • Consistent. In the process of connecting nodes, data is a transient state that is considered a match. When the connection is complete, it is determined whether it is UpToDate or Outdated.

  • UpToDate. Data consistency is consistent and up to date. This is the normal state during replication.

bsr distinguishes Inconsistent data and Outdated data. Inconsistent Data refers to data that cannot be accessed or used in any way. As a typical example, the data on the target side in synchronization is inconsistent. When the synchronization is in progress, the target-side data is partially up-to-date, but some are from the past, so it is not possible to regard it as a single data point. In such a case, if the device has a file system, the file system cannot be mounted or even a file system check cannot be performed.

Outdated data is data that is guaranteed to be consistent with the data at a specific point in time, but is not synchronized with the primary node and the latest data. This happens when the replication link goes down, either temporarily or permanently. There is no problem in using the disconnected outdated data, but this is the data from the past. To prevent this data from being serviced, bsr does not allow promotion of resources with outdated data by default. However, if necessary (in an emergency), you can forcibly promote outdated data.

Events

You can check the real-time event status with the following command. The bsrsetup events2 command can be used with the '--statistics', '--timestamp' options.

C:\Program Files\bsr\bin>bsrsetup events2 --now r0
exists resource name:r0 role:Secondary suspended:no
exists connection name:r0 peer-node-id:1 conn-name:remote-host connection:Connected role:Secondary
exists device name:r0 volume:0 minor:7 disk:UpToDate
exists device name:r0 volume:1 minor:8 disk:UpToDate
exists peer-device name:r0 peer-node-id:1 conn-name:remote-host volume:0
replication:Established peer-disk:UpToDate resync-suspended:no
exists peer-device name:r0 peer-node-id:1 conn-name:remote-host volume:1
replication:Established peer-disk:UpToDate resync-suspended:no
exists -

Efficient synchronization

bsr provides various functions such as FastSync, checksum-based synchronization, truck-based synchronization, and bitmap clear synchronization for efficient synchronization.

Fast Synchronization

bsr has improved the existing full synchronization method for the entire disk area to Fast Synchronization(FastSync), which synchronizes only the area used by the file system. bsr collects file system's usage area information for FastSync, records the usage area in OOS and performs synchronization.

FastSync is applied at the time of bsradm primary --force command for initial synchronization, invalidate / invalidate-remote, and online verify. Users do not need to set any special options for FastSync to work.

Checksum-based synchronization

Checksum data summarization can further improve the efficiency of bsr's synchronization algorithm. Checksum-based synchronization reads blocks before synchronization, obtains a hash summary of the contents on the current disk, and then reads the same sector from the other node and compares it with the obtained hash summary. If the hash match, the re-write for the block is omitted, and if they do not match, synchronization data is transmitted. This method can be advantageous in performance compared to the existing method of simply overwriting the block to be synchronized, and if the file system writes the same contents to the sector again while disconnected (disconnected), resynchronization is omitted for that sector. Overall, it have a more shorten synchronization time.

Truck-based synchronization

Truck-based synchronization by directly importing and configuring disks is suitable for the following situations.

  • Initially, the amount of data to be synchronized is very large (hundreds of gigabytes or more)

  • When the rate of change of the data to be copied is expected to be small compared to the huge data size

  • When available network bandwidth between source and target sites is limited

In the above situation, if you do not synchronize by truck-based synchronization and initialize with the normal device synchronization method, it will take a very long time during synchronization.

Let's say one situation. There is a local node that has been disconnected from being in Primary. That is, the device configuration is complete and the same copy of bsr.conf exists on both nodes. Commands for initial resource promotion have been executed on the local node, but the remote node is not connected yet.

  • Run the following command on the local node.

    bsradm new-current-uuid --clear-bitmap <resource>
  • Create copies of the data to be replicated and the metadata of the data. For example, you could use a hot-swappable drive in the RAID-1 mirror. Of course, in this situation, the RAID set will need to be replaced with a new drive to continue mirroring. However, the disk drive you removed here is a literal copy that can be used elsewhere. If your local block device supports snapshot copy function, you can use it.

  • Run the following command on the local node. There is no --clear-bitmap option in the second command run.

    bsradm new-current-uuid <resource>
  • Configures the same copy of the original data to be physically taken directly for use on remote nodes.

  • You can directly connect the disk physically, or you can copy the imported data to the existing disk and use it in bit units. This process should be done not only on the mirroring data, but also on the metadata. If such a procedure cannot be accepted, this method cannot proceed.

  • Start the bsr resource on the remote node.

    bsradm up <resource>

When both nodes are connected, they will not initiate full device synchronization. Instead, only synchronization of blocks that have changed since the bsradm--clear-bitmap new-current-uuid command was invoked is automatically initiated.

If there is no change, there may be a simple synchronization depending on the area covered in the Activity Log rolled back from the new secondary node. 

Bitmap clear synchronization

You can use the option to clear the bitmap (--clear-bitmap) so that it can be quickly sync without an initial full synchronization over a long period of time. The following are examples of these operations.

It can be used to skip the initial sync by creating a new Current UUID and clearing the Bitmap UUID. This use case only works for the metadata just created.

  1. On both nodes, initialize the meta and configure the device. bsradm -- --force create-md res

  2. Start resources of both nodes and recognize each other's volume size at the time of initial handshake. bsradm up res

  3. When both nodes are connected as Secondary / Secondary, Inconsistent / Inconsistent, create a new UUID and clear the bitmap. bsradm new-current-uuid --clear-bitmap res

  4. Now both nodes are in Secondary / Secondary, UpToDate / UpToDate state, and promote one side to Primary to create a file system. bsradm primary res mkfs -t fs-type $(bsradm sh-dev res)

One obvious side effect of this approach is that the replica is full of old garbage (unless you make it the same using other methods), it is expected to find the number of unsynchronized blocks when online verification. This method should never be used in situations where the volume already has data. At first glance it may seem to work, but once you switch to another node, the data that was already there is not replicated, so the data is broken.

Adjust sync speed

When synchronization is in the background, the target data is temporarily inconsistent. This inconsistent state should be kept as short as possible, which is good in terms of consistency, so it is advantageous to have a sufficient synchronization speed. However, replication and synchronization share the same network band, and if the synchronization band is set high, relatively few replication bands can be provided. Lowering the replication bandwidth affects local I/O latency and consequently lowers local I/O performance of the active server. Because either side of replication or synchronization occupies a lot of bands unilaterally, it affects the operation of the other side relatively, so bsr implements variable-rate synchronization that adequately adjusts the synchronization band according to the replication situation while guaranteeing the replication band as much as possible. bsr use it as the default policy. Conversely, the fixed-rate synchronization policy is generally not recommended and should only be used in special situations, as it can lead to a decrease in local I/O performance when used during server operation in a way that ensures synchronization bands regardless of replication.

Replication and synchronization

  • Replication is the action to reflect the I/O of the disk change occurring locally to the target in real time. replication is performed in the context where the incremental I/O is written to the local disk, thus affecting the local I/O latency.

  • Synchronization is the operation of matching the data on the source side disk with the data on the target side by out-of-sync area of the entire disk volume this is processed from 0 sector to last sector of volume sequentially.

To clearly differentiate these differences, bsr always describes replication and synchronization separately.

It is pointless to set the synchronization speed to a value higher than the maximum disk write speed of the standby node. Since the standby node is the target of device synchronization in progress, the synchronization speed cannot be faster than the write speed of the I/O subsystem that the standby node allows. For the same reason, setting the sync speed to a value higher than the bandwidth available on the replication network makes no sense.

Fixed rate synchronization

The maximum bandwidth used for resynchronization in the background is determined by the resource's resync-rate option. These options are included in the disk section of the /etc/bsr.conf resource configuration as follows:

resource <resource> {
  disk {
    resync-rate 40M;  
    c-min-rate 40M;  
    c-plan-ahead 0;  
    ...
  }
  ...
}

The resync-rate and c-min-rate settings are specified in bytes per second. The default unit is Kibibyte, and the value of 4096 is interpreted as 4 MiB.

Important 

  • If the c-plan-ahead parameter is set to a positive value, the synchronization speed is dynamically adjusted. This value is set to 20 by default, but this value should be set to 0 for fixed rate synchronization speed.

  • c-min-rate is a parameter to set the minimum synchronization speed when replication and synchronization are performed simultaneously. This value is set to 250k by default, and if you want to guarantee a fixed synchronization speed, you should set it to the same value as resync-rate.

Variable rate synchronization

Fixed-rate synchronization is not an optimal method when multiple resources share a replication/synchronization network. Because they share the same network, if a synchronization rate is occupied for a specific replication resource channel, other resources are not guaranteed a fixed synchronization rate. In this case, you can mitigate that the synchronization rate is occupied by configuring to dynamically adjust the synchronization rate of each replication channel through variable rate synchronization. bsr determines the initial sync speed in this mode and then continuously adjusts the sync speed through an automatic control loop algorithm. This algorithm ensures sufficient bandwidth for foreground replication and greatly mitigates the impact of background synchronization on foreground I/O.

The optimal configuration for variable rate synchronization may vary depending on the available network bandwidth, application I/O pattern, and replication link congestion, and the optimal configuration setting may vary depending on whether replication accelerator(DRX) is used.

Synchronization speed estimation

You can estimate the synchronization time with the following formula.

tresync = D/R

  • tresync is the estimated synchronization time.

  • D is the size of the data to be synchronized under the assumption that it is rarely affected (such as data being modified in the event of a broken network link).

  • R is the tunable synchronization rate, which has different limits depending on the replication network environment and the processing performance of the I/O subsystem.

Congestion mode

Used only in asynchronous replication.

In an environment where the replication bandwidth is variable (WAN replication environment), the replication link can sometimes become congested. Because of this, if the primary node's I/O waits, the performance of the local I/O will be degraded, which is undesirable. When detecting this congestion, you can configure it to suspend replication. Instead, in the situation where replication is interrupted, the primary data set is ahead of the secondary data, and these advanced data blocks are recorded as out-of-sync (OOS). when congestion is released, after all these oos is resolved through resynchronization. The following is an example of setting the congestion policy.

In the resource configuration file, the on-congestion option item sets the congestion mode, and the congestion-fill item sets the recognition threshold for congestion.

resource <resource> {
  net {
    sndbuf-size 20M;
    on-congestion pull-ahead;
    congestion-fill 2G;
    congestion-extents 2000;
    ...
  }
  ...
}

The pull-ahead option is used with congestion-fill and congestion-extents. The recommended values for congestion-fill are:

  • When linking a replication accelerator(DRX), set it to about 90% of the DRX buffer size.

  • If DRX is not linked, set to 90% of sndbuf-size.

  • The recommended value for congestion-extents is 90% of the al-extents setting.

Disk flush

If the target node suddenly goes down due to power failure during replication, data loss may occur if the disk cache area is not backed up by a battery backup device (BBWC). In order to prevent this in advance, in the process of writing data to the disk of the target, after data is written to the media, the flush operation is always performed to prevent data loss.

The storage device equipped with BBWC does not need to perform the disk flush operation, so it provides an option to disable the flush as follows.

resource <resource>
  disk {
    disk-flushes no;
    md-flushes no;
    ...
  }
  ...
}

You should disable device flushing only when running bsr on devices with battery backup write cache (BBWC). Most storage controllers automatically disable the write cache when the battery is exhausted and switch to write through mode when the battery is exhausted.

Consistency verification

Consistency verification is a function that performs replication traffic in real-time in block units during replication or compares block-by-block based on hash summaries to verify that the source and target data are completely matched in whole (used) disk volume units.

Traffic integrity check

bsr can use cryptographic message digest algorithms to verify message integrity between both nodes. When this function is used, bsr generates a message summary of all data blocks, delivers it to the other node, and verifies the integrity of the replication packet at the other node. If the summarized blocks do not match each other, retransmission is requested.

When replicating data, bsr can protect the source data against the following error conditions through this consistency check, and failure to respond to such situations can potentially cause data corruption during replication.

  • Bit errors (bit flips) that occur in data transferred between main memory and the network interface of the transmitting node.

    • If the TCP checksum offload function provided by LAN Card is recently activated, hardware bitflip may not be detected by software.

  • Bit errors that occur on data being transferred from the network interface to the receiving node's main memory (the same applies for TCP checksum offloading).

  • Damage due to a bug or race condition within the network interface firmware or driver.

  • Bit flips or random damage injected by the recombination network component between nodes (if not using direct connection, back-to-back connection).

Replication traffic consistency checking is disabled by default. To enable it, add the following to the resource configuration in /etc/bsr.conf.

resource <resource> {
  net {
    data-integrity-alg <algorithm>;
  }
  ...
}

<algorithm> is a message hashing compression algorithm supported by the kernel cryptography API in the system's kernel configuration. On Windows, only crc32c is supported.

After changing the resource configuration of both nodes identically, execute bsradm adjust <resource> on both nodes to apply the changes.

온라인 정합성 검사

온라인 정합성 검사는 장치 운영 중에 노드 간의 블록별 데이터의 정합성을 확인하는 기능입니다. 정합성 검사는 중복 검사하지 않으며 네트워크 대역폭을 효율적으로 사용하고 파일시스템에 의해 사용하고 있는 영역에 대해서 검사하는 것을 기본 동작으로 합니다.

온라인 정합성 검사는 한 쪽 노드에서(verification source) 특정 리소스 스토리지상의 모든 데이터 블럭을 순차적으로 암호화 요약(cryptographic digest)시키고, 요약된 내용을 상대 노드(verification target)로 전송하여 같은 블럭위치의 내용을 요약 비교 합니다. 만약 요약된 내용이 일치하지 않으면, 해당 블럭은 out-of-sync로 표시되고 나중에 동기화대상이 됩니다. 여기서 블럭의 전체 내용을 전송하는 것이 아니라 최소한의 요약본만 전송하기 때문에 네트워크 대역을 효과적으로 사용하게 됩니다.

리소스의 정합성을 검증하는 작업은 운영 중에 검사하기 때문에 온라인 검사와 복제가 동시에 수행될 경우 약간의 복제성능 저하가 있을 수 있습니다. 하지만 서비스를 중단할 필요가 없고 검사를 하거나 검사 후 동기화 과정 중에서 시스템의 다운 타임이 발생하지 않는 장점이 있습니다. 

보통 온라인 정합성 검사에 따른 작업은 OS에서 예약된 작업으로 등록하여 운영 I/O 부하가 적은 시간 대에 주기적으로 수행하는 것이 일반적인 사용법입니다.

활성화

온라인 정합성 검사는 기본적으로 비활성화되어 있는데, bsr.conf 내의 리소스 구성에 다음과 같은 내용을 추가하면 활성화할 수 있습니다.

resource <resource> {
   net {
       verify-alg <algorithm>;
   }
   ...
}

algorithm 은 메시지 해싱 알고리즘을 말하며 Windows 에선 crc32c 만 지원합니다.

온라인 검증을 활성화 하기 위해 양 노드의 리소스 구성을 똑같이 변경한 후, 양 노드에서 bsradm adjust <resource>를 실행하여 변경사항을 적용시킵니다.

온라인 정합성 검사 실행

온라인 정합성 검사를 활성화한 후, 다음 명령을 사용하여 검사를 실행할 수 있습니다.

drbdadm verify <resource>

온라인 검사가 실행되면, bsr 은 <resource>에서 동기화되지 않은 블록을 알아내 표시하고 이를 기록합니다. 이때 디바이스를 사용하는 모든 응용 프로그램은 아무런 제약 없이 동작할 수 있으며, 리소스의 역할 변경도 가능합니다.

verify 명령은 디스크 상태를 UpToDate로 변경한 후 검증을 수행합니다. 따라서 초기싱크가 완료된 이후 UpToDate 인 복제 소스 노드 측에서 수행하는 것이 바람직 합니다. 예를 들어, Inconsistent 상태의 디스크 노드 측에서 verify를 수행하면 디스크 상태가 UpToDate로 변경 되어 운영 상 문제가 될 수 있으므로 주의가 필요합니다.

검증이 실행되는 동안 out-of-sync 블록이 감지되면, 검증이 완료된 후에 다음 명령으로 동기화할 수 있습니다. 이 때 동기화가 되는 방향은 Primary 노드에서 Secondary 방향으로 이루어지며 Secondary/Secondary 상태에서는 동기화를 진행하지 않습니다. 따라서 Online 검증에 따른 OOS를 해소하기 위해선 소스 측 노드에 대한 Primary로의 승격이 요구됩니다. 

drbdadm disconnect <resource>
drbdadm connect <resource>

자동 검사

정기적으로 정합성 검사를 할 필요가 있다면, 다음과 같은 방법으로 bsradm verify <resource> 명령을 작업 스케줄러에 등록합니다.

우선 노드 중 하나에서 특정 위치에 다음과 같은 내용의 스크립트 파일을 만듭니다. 

drbdadm verify <resource>

모든 리소스를 검증하려면 <resource> 대신 all 키워드를 사용하면 됩니다. 

다음은 schtasks(windows 스케줄 설정 명령어)를 사용해 예약된 작업을 생성하는 예 입니다. 다음과 같이 설정 하면 매주 일요일 자정 42분에 온라인 정합성 검사를 수행하게 됩니다.

 schtasks /create /tn "drbd_verify" /tr "%wdrbd_path%\verify.bat" /sc WEEKLY /D sun /st 00:42
  • No labels