Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Info

About mount operation

There is a difference between Windows and Linux OS for mount behavior. In Linux, the mounting process to use the volume is required manually, but in the Windows environment, mounting of the volume is performed automatically at the shell level of the operating system, so no separate mount command is required. Therefore, Linux requires an additional mount operation to use the volume after promotion.

Demotion

The transition from Primary role to Secondary role is called demotion.

Code Block
bsradm secondary <resource>
Info

On Linux, unmounting of the volume is required before performing a demotion. In the Windows environment, there is no need for a separate unmounting process since the unmount command is performed internally.

Unmounting and demotion of resources entails switching the role to Secondary as the heaviest task among the command operations of bsr, and reflecting all data pending replication to the target side. This is the basic operation structure for matching data consistency between the replication source and the target. This operation ensures data consistency between the primary and secondary at the time of demoting. Therefore, in the process of unmounting and demoting, it is necessary to keep in mind a certain amount of latency that is required to reflect all pending data to the target.

...

manually fail-over

The procedure for manual transfer is as follows.

...

Stop all applications or services using the bsr device on the primary node, and demote the resource to secondary (after umounting the volume on Linux).

Code Block
bsradm secondary <resource>

...

Note

bsr defaults to FastSync, which synchronizes only the areas used in the file system. However, if the file system of the replication volume is already damaged for some reason, FastSync based on the damaged information of the file system becomes impossible. In preparation for this situation, bsr performs an integrity check (fsck) of the file system before performing the initial synchronization, and if the file system is broken, the initial synchronization fails.

In this case, you will need to manually recover the file system and try to initialize the resource again.

Demotion

The transition from Primary role to Secondary role is called demotion.

Code Block
bsradm secondary <resource>
Info

On Linux, unmounting of the volume is required before performing a demotion. In the Windows environment, there is no need for a separate unmounting process since the unmount command is performed internally.

Unmounting and demotion of resources entails switching the role to Secondary as the heaviest task among the command operations of bsr, and reflecting all data pending replication to the target side. This is the basic operation structure for matching data consistency between the replication source and the target. This operation ensures data consistency between the primary and secondary at the time of demoting. Therefore, in the process of unmounting and demoting, it is necessary to keep in mind a certain amount of latency that is required to reflect all pending data to the target.

Info

manually fail-over

The procedure for manual transfer is as follows.

  1. Stop all applications or services using the bsr device on the primary node, and demote the resource to secondary (after umounting the volume on Linux).

    Code Block
    bsradm 
primary
  1. secondary <resource>
  2. Execute the following command on the node you want to promote to primary. Restart the service (after mounting the volume on Linux case).

    Code Block
    bsradm primary <resource>

Resource down

You can stop the resource with the bsradm down command. down stops in the reverse order of the up process described above, and if the resource was in the promoted state, demotes first. In short, resource demotion, replication disconnection, volume detach, and resource release in the following order.

...

bsr provides various functions such as FastSync, checksum-based synchronization, truck-based synchronization, and bitmap clear synchronization for efficient synchronization.

Fast Synchronization

bsr has improved changed the existing full synchronization method that performs for the entire disk area to Fast Synchronization( FastSync), which synchronizes only the area used by the file system. bsr collects file system's usage area information for FastSync, records the usage area in OOS and performs synchronization.FastSync is applied at the time of For example, if you are only using 100MB on a 1TB disk, initial synchronization can be completed 10 times faster than the existing full synchronization (1TB) because only 100MB disk area is synchronized. FastSync operates at the following times.

  • Initial full synchronization (bsradm primary --force

...

  • )

  • Manual full synchronization (invalidate/invalidate-remote

...

Checksum-based synchronization

Checksum data summarization can further improve the efficiency of bsr's synchronization algorithm. Checksum-based synchronization reads blocks before synchronization, obtains a hash summary of the contents on the current disk, and then reads the same sector from the other node and compares it with the obtained hash summary. If the hash match, the re-write for the block is omitted, and if they do not match, synchronization data is transmitted. This method can be advantageous in performance compared to the existing method of simply overwriting the block to be synchronized, and if the file system writes the same contents to the sector again while disconnected (disconnected), resynchronization is omitted for that sector. Overall, it have a more shorten synchronization time.

Truck-based synchronization

Truck-based synchronization by directly importing and configuring disks is suitable for the following situations.

  • Initially, the amount of data to be synchronized is very large (hundreds of gigabytes or more)

  • When the rate of change of the data to be copied is expected to be small compared to the huge data size

  • When available network bandwidth between source and target sites is limited

In the above situation, if you do not synchronize by truck-based synchronization and initialize with the normal device synchronization method, it will take a very long time during synchronization.

Let's say one situation. There is a local node that has been disconnected from being in Primary. That is, the device configuration is complete and the same copy of bsr.conf exists on both nodes. Commands for initial resource promotion have been executed on the local node, but the remote node is not connected yet.

  • Run the following command on the local node.

    Code Block
    bsradm new-current-uuid --clear-bitmap <resource>
  • Create copies of the data to be replicated and the metadata of the data. For example, you could use a hot-swappable drive in the RAID-1 mirror. Of course, in this situation, the RAID set will need to be replaced with a new drive to continue mirroring. However, the disk drive you removed here is a literal copy that can be used elsewhere. If your local block device supports snapshot copy function, you can use it.

  • Run the following command on the local node. There is no --clear-bitmap option in the second command run.

    Code Block
    bsradm new-current-uuid <resource>
  • Configures the same copy of the original data to be physically taken directly for use on remote nodes.

  • You can directly connect the disk physically, or you can copy the imported data to the existing disk and use it in bit units. This process should be done not only on the mirroring data, but also on the metadata. If such a procedure cannot be accepted, this method cannot proceed.

  • Start the bsr resource on the remote node.

    Code Block
    bsradm up <resource>

When both nodes are connected, they will not initiate full device synchronization. Instead, only synchronization of blocks that have changed since the bsradm--clear-bitmap new-current-uuid command was invoked is automatically initiated.

If there is no change, there may be a simple synchronization depending on the area covered in the Activity Log rolled back from the new secondary node. 

Bitmap clear synchronization

You can use the option to clear the bitmap (--clear-bitmap) so that it can be quickly sync without an initial full synchronization over a long period of time. The following are examples of these operations.

It can be used to skip the initial sync by creating a new Current UUID and clearing the Bitmap UUID. This use case only works for the metadata just created.

  1. On both nodes, initialize the meta and configure the device. bsradm -- --force create-md res

  2. Start resources of both nodes and recognize each other's volume size at the time of initial handshake. bsradm up res

  3. When both nodes are connected as Secondary / Secondary, Inconsistent / Inconsistent, create a new UUID and clear the bitmap. bsradm new-current-uuid --clear-bitmap res

  4. Now both nodes are in Secondary / Secondary, UpToDate / UpToDate state, and promote one side to Primary to create a file system. bsradm primary res mkfs -t fs-type $(bsradm sh-dev res)

One obvious side effect of this approach is that the replica is full of old garbage (unless you make it the same using other methods), it is expected to find the number of unsynchronized blocks when online verification. This method should never be used in situations where the volume already has data. At first glance it may seem to work, but once you switch to another node, the data that was already there is not replicated, so the data is broken.

Adjust sync speed

When synchronization is in the background, the target data is temporarily inconsistent. This inconsistent state should be kept as short as possible, which is good in terms of consistency, so it is advantageous to have a sufficient synchronization speed. However, replication and synchronization share the same network band, and if the synchronization band is set high, relatively few replication bands can be provided. Lowering the replication bandwidth affects local I/O latency and consequently lowers local I/O performance of the active server. Because either side of replication or synchronization occupies a lot of bands unilaterally, it affects the operation of the other side relatively, so bsr implements variable-rate synchronization that adequately adjusts the synchronization band according to the replication situation while guaranteeing the replication band as much as possible. bsr use it as the default policy. Conversely, the fixed-rate synchronization policy is generally not recommended and should only be used in special situations, as it can lead to a decrease in local I/O performance when used during server operation in a way that ensures synchronization bands regardless of replication.

Info

Replication and synchronization

  • Replication is the action to reflect the I/O of the disk change occurring locally to the target in real time. replication is performed in the context where the incremental I/O is written to the local disk, thus affecting the local I/O latency.

  • Synchronization is the operation of matching the data on the source side disk with the data on the target side by out-of-sync area of the entire disk volume this is processed from 0 sector to last sector of volume sequentially.

To clearly differentiate these differences, bsr always describes replication and synchronization separately.

Info

It is pointless to set the synchronization speed to a value higher than the maximum disk write speed of the standby node. Since the standby node is the target of device synchronization in progress, the synchronization speed cannot be faster than the write speed of the I/O subsystem that the standby node allows. For the same reason, setting the sync speed to a value higher than the bandwidth available on the replication network makes no sense.

Fixed rate synchronization

The maximum bandwidth used for resynchronization in the background is determined by the resource's resync-rate option. These options are included in the disk section of the /etc/bsr.conf resource configuration as follows:

Code Block
resource <resource> {
  disk {
    resync-rate 40M;  
    c-min-rate 40M;  
    c-plan-ahead 0;  
    ...
  }
  ...
}

The resync-rate and c-min-rate settings are specified in bytes per second. The default unit is Kibibyte, and the value of 4096 is interpreted as 4 MiB.

Info

Important 

  • If the c-plan-ahead parameter is set to a positive value, the synchronization speed is dynamically adjusted. This value is set to 20 by default, but this value should be set to 0 for fixed rate synchronization speed.

  • c-min-rate is a parameter to set the minimum synchronization speed when replication and synchronization are performed simultaneously. This value is set to 250k by default, and if you want to guarantee a fixed synchronization speed, you should set it to the same value as resync-rate.

Variable rate synchronization

Fixed-rate synchronization is not an optimal method when multiple resources share a replication/synchronization network. Because they share the same network, if a synchronization rate is occupied for a specific replication resource channel, other resources are not guaranteed a fixed synchronization rate. In this case, you can mitigate that the synchronization rate is occupied by configuring to dynamically adjust the synchronization rate of each replication channel through variable rate synchronization. bsr determines the initial sync speed in this mode and then continuously adjusts the sync speed through an automatic control loop algorithm. This algorithm ensures sufficient bandwidth for foreground replication and greatly mitigates the impact of background synchronization on foreground I/O.

The optimal configuration for variable rate synchronization may vary depending on the available network bandwidth, application I/O pattern, and replication link congestion, and the optimal configuration setting may vary depending on whether replication accelerator(DRX) is used.

Info

Synchronization speed estimation

You can estimate the synchronization time with the following formula.

tresync = D/R

  • tresync is the estimated synchronization time.

  • D is the size of the data to be synchronized under the assumption that it is rarely affected (such as data being modified in the event of a broken network link).

  • R is the tunable synchronization rate, which has different limits depending on the replication network environment and the processing performance of the I/O subsystem.

Congestion mode

Info

Used only in asynchronous replication.

In an environment where the replication bandwidth is variable (WAN replication environment), the replication link can sometimes become congested. Because of this, if the primary node's I/O waits, the performance of the local I/O will be degraded, which is undesirable. When detecting this congestion, you can configure it to suspend replication. Instead, in the situation where replication is interrupted, the primary data set is ahead of the secondary data, and these advanced data blocks are recorded as out-of-sync (OOS). when congestion is released, after all these oos is resolved through resynchronization. The following is an example of setting the congestion policy.

In the resource configuration file, the on-congestion option item sets the congestion mode, and the congestion-fill item sets the recognition threshold for congestion.

Code Block
resource <resource> {
  net {
    sndbuf-size 20M;
    on-congestion pull-ahead;
    congestion-fill 2G;
    congestion-extents 2000;
    ...
  }
  ...
}

The pull-ahead option is used with congestion-fill and congestion-extents. The recommended values for congestion-fill are:

  • When linking a replication accelerator(DRX), set it to about 90% of the DRX buffer size.

  • If DRX is not linked, set to 90% of sndbuf-size.

  • The recommended value for congestion-extents is 90% of the al-extents setting.

Disk flush

If the target node suddenly goes down due to power failure during replication, data loss may occur if the disk cache area is not backed up by a battery backup device (BBWC). In order to prevent this in advance, in the process of writing data to the disk of the target, after data is written to the media, the flush operation is always performed to prevent data loss.

The storage device equipped with BBWC does not need to perform the disk flush operation, so it provides an option to disable the flush as follows.

...

  • )

  • Online Verify check (bsradm verify)

Note

Note!

FastSync must first obtain information about the file system usage from the disk space before performing initial synchronization, but if the file system is damaged (broken), information about the used area may be processed inaccurately. If FastSync is processed without recognizing this, it will result in inconsistent consistency between the source and target, so you must be very careful.

Therefore, in order to prepare for this situation, bsr first requests a file system integrity check (chkdsk or fsck) before performing initial synchronization through bsradm primary --force, and enables FastSync when there are no problems with the results.

Before performing bsr initial synchronization, the administrator needs to perform a file system integrity check to determine the health status of the clone disk in advance.

Info

To change to the old FullSync method

bsrcon /set_fast_sync 0

When you want to know the current initial synchronization method

bsrcon /get_fast_sync

Checksum-based synchronization

Checksum data summarization can further improve the efficiency of bsr's synchronization algorithm. Checksum-based synchronization reads blocks before synchronization, obtains a hash summary of the contents on the current disk, and then reads the same sector from the other node and compares it with the obtained hash summary. If the hash match, the re-write for the block is omitted, and if they do not match, synchronization data is transmitted. This method can be advantageous in performance compared to the existing method of simply overwriting the block to be synchronized, and if the file system writes the same contents to the sector again while disconnected (disconnected), resynchronization is omitted for that sector. Overall, it have a more shorten synchronization time.

Truck-based synchronization

Truck-based synchronization by directly importing and configuring disks is suitable for the following situations.

  • Initially, the amount of data to be synchronized is very large (hundreds of gigabytes or more)

  • When the rate of change of the data to be copied is expected to be small compared to the huge data size

  • When available network bandwidth between source and target sites is limited

In the above situation, if you do not synchronize by truck-based synchronization and initialize with the normal device synchronization method, it will take a very long time during synchronization.

Let's say one situation. There is a local node that has been disconnected from being in Primary. That is, the device configuration is complete and the same copy of bsr.conf exists on both nodes. Commands for initial resource promotion have been executed on the local node, but the remote node is not connected yet.

  • Run the following command on the local node.

    Code Block
    bsradm new-current-uuid --clear-bitmap <resource>
  • Create copies of the data to be replicated and the metadata of the data. For example, you could use a hot-swappable drive in the RAID-1 mirror. Of course, in this situation, the RAID set will need to be replaced with a new drive to continue mirroring. However, the disk drive you removed here is a literal copy that can be used elsewhere. If your local block device supports snapshot copy function, you can use it.

  • Run the following command on the local node. There is no --clear-bitmap option in the second command run.

    Code Block
    bsradm new-current-uuid <resource>
  • Configures the same copy of the original data to be physically taken directly for use on remote nodes.

  • You can directly connect the disk physically, or you can copy the imported data to the existing disk and use it in bit units. This process should be done not only on the mirroring data, but also on the metadata. If such a procedure cannot be accepted, this method cannot proceed.

  • Start the bsr resource on the remote node.

    Code Block
    bsradm up <resource>

When both nodes are connected, they will not initiate full device synchronization. Instead, only synchronization of blocks that have changed since the bsradm--clear-bitmap new-current-uuid command was invoked is automatically initiated.

If there is no change, there may be a simple synchronization depending on the area covered in the Activity Log rolled back from the new secondary node. 

Bitmap clear synchronization

You can use the option to clear the bitmap (--clear-bitmap) so that it can be quickly sync without an initial full synchronization over a long period of time. The following are examples of these operations.

It can be used to skip the initial sync by creating a new Current UUID and clearing the Bitmap UUID. This use case only works for the metadata just created.

  1. On both nodes, initialize the meta and configure the device. bsradm -- --force create-md res

  2. Start resources of both nodes and recognize each other's volume size at the time of initial handshake. bsradm up res

  3. When both nodes are connected as Secondary / Secondary, Inconsistent / Inconsistent, create a new UUID and clear the bitmap. bsradm new-current-uuid --clear-bitmap res

  4. Now both nodes are in Secondary / Secondary, UpToDate / UpToDate state, and promote one side to Primary to create a file system. bsradm primary res mkfs -t fs-type $(bsradm sh-dev res)

One obvious side effect of this approach is that the replica is full of old garbage (unless you make it the same using other methods), it is expected to find the number of unsynchronized blocks when online verification. This method should never be used in situations where the volume already has data. At first glance it may seem to work, but once you switch to another node, the data that was already there is not replicated, so the data is broken.

Adjusting the synchronization speed

When synchronization is running in the background, the data on the target is temporarily in an inconsistent state. This inconsistent state should be kept as short as possible to ensure consistency, so it is beneficial to have a high enough synchronization rate. However, replication and synchronization share the same network band, and if the synchronization band is set high, replication will be given relatively little bandwidth. If the replication band is lowered, it will affect local I/O latency and result in local I/O performance degradation. If either replication or synchronization unilaterally occupies a lot of bandwidth, it will affect the behavior of the other, so we implement variable-band synchronization, which guarantees the replication band as much as possible while moderating the synchronization band according to the replication situation, and this is the default policy. In contrast, the fixed-band synchronization policy, which guarantees the synchronization band at all times regardless of replication, can cause local I/O performance degradation if used during server operation, so it is not generally recommended and should be used only in special situations.

Info

Replication and synchronization

  • Replication is the action to reflect the I/O of the disk change occurring locally to the target in real time. replication is performed in the context where the incremental I/O is written to the local disk, thus affecting the local I/O latency.

  • Synchronization is the operation of matching the data on the source side disk with the data on the target side by out-of-sync area of the entire disk volume this is processed from 0 sector to last sector of volume sequentially.

To clearly differentiate these differences, bsr always describes replication and synchronization separately.

Info

It makes no sense to set the sync rate to a value higher than the maximum disk write speed of the standby node. Because the standby node is the target of ongoing device synchronization, the sync rate cannot be faster than the write speed of its I/O subsystem allows. For the same reason, it makes no sense to set the sync rate to a value higher than the bandwidth available on the replication network.

Fixed rate synchronization

The maximum bandwidth used for resynchronization in the background is determined by the resource's resync-rate option. These options are included in the disk section of the /etc/bsr.conf resource configuration as follows:

Code Block
resource <resource> {
  disk {
    resync-rate 40M;  
    c-min-rate 40M;  
    c-plan-ahead 0;  
    ...
  }
  ...
}

The resync-rate and c-min-rate settings are specified in bytes per second. The default unit is Kibibyte, and the value of 4096 is interpreted as 4 MiB.

Info

Important 

Dynamically adjusts the synchronization rate when the c-plan-ahead parameter is set to a positive value. This value is set to 20 by default and should be set to 0 for fixed-rate synchronization.

Variable rate synchronization

Configuring with fixed-bandwidth synchronization is problematic for configurations where multiple resources share a replication/synchronization network. Because the resources share the same replication network, if the synchronization rate is occupied for a particular replication resource channel, the other resources are not guaranteed a fixed synchronization rate. In this case, variable bandwidth synchronization can be configured to dynamically adjust the synchronization rate of each replication channel to proactively adjust the synchronization band in response to other resources taking over. Variable-band synchronization determines an initial synchronization rate (by resync-rate) and then uses an automatic control algorithm to continuously adjust the synchronization rate. This algorithm ensures that the synchronization band is available from c-min-rate to c-max-rate while still allowing replication to operate on the front end. Setting c-max-rate too high will affect the replication band, so it is preferable to set it to match the network band.

The optimal configuration for variable bandwidth synchronization depends on the available network bandwidth, application I/O patterns, and replication link congestion, and the optimal configuration settings may vary depending on whether you are using Replication Accelerator (DRX).

Info

c-min-rate guarantees a minimum synchronization rate of a specified size, regardless of whether you have a fixed-bandwidth or variable-bandwidth setting.

Info

Difference between BSR and DRBD when handling replication and synchronization at the same time

  • BSR tries to keep the sync band at a value from c-min-rate to c-max-rate when handling synchronization and replication simultaneously, meaning it tries to free up as much sync band as possible.

  • DRBD forces the synchronization band to drop to the value of c-min-rate when handling synchronization and replication at the same time.

Info

Synchronization speed estimation

You can estimate the synchronization time with the following formula.

tresync = D/R

  • tresync is the estimated synchronization time.

  • D is the size of the data to be synchronized under the assumption that it is rarely affected (such as data being modified in the event of a broken network link).

  • R is the tunable synchronization rate, which has different limits depending on the replication network environment and the processing performance of the I/O subsystem.

Set the synchronization ratio

You can also set the synchronization rate as a percentage of the replication bandwidth.

Code Block
resource <resource> {
  disk {
    c-min-rate 40M;  
    resync-ratio "3:1";
    ...
  }
  ...
}

You should disable device flushing only when running bsr on devices with battery backup write cache (BBWC). Most storage controllers automatically disable the write cache when the battery is exhausted and switch to write through mode when the battery is exhausted.

정합성 검증

정합성 검증은 복제를 수행하는 과정에서 복제 트래픽을 블럭 단위로 실시간 수행하거나 전체 디스크 볼륨 단위로 소스와 타깃의 데이터가 완전히 일치하는지 해쉬 요약을 기반으로 블럭단위로 비교하는 기능입니다.

트래픽 검사

bsr 은 암호화 메시지 요약 알고리즘을 사용하여 양 노드 간의 메시지 정합성을 검증할 수 있습니다. 이 기능을 사용하게 되면 bsr 은 모든 데이터 블록의 메시지 요약본을 생성하고 그것을 상대 노드에게 전달한 후 상대편 노드에서 복제 패킷의 정합성을 확인합니다. 만약 요약된 블럭이 서로 일치하지 않으면 재전송을 요청합니다.

bsr 은 데이터 복제 시 이러한 정합성 검사를 통해 다음과 같은 에러 상황들에 대해 소스 데이터를 보호할 수 있으며, 만약 이와 같은 상황들에 대응하지 못하면 잠재적으로 복제 중 데이터 손상이 야기될 수 있습니다.

  • 주 메모리와 전송 노드의 네트워크 인터페이스 사이에서 전달된 데이터에서 발생하는 비트 오류 (비트 플립).

    • 최근 랜카드가 제공하는 TCP 체크섬 오프로드 기능이 활성화 될 경우 하드웨어적인 비트플립이 소프트웨어 적으로 감지되지 않을 수 있습니다

  • 네트워크 인터페이스에서 수신 노드의 주 메모리로 전송되는 데이터에서 발생하는 비트 오류(동일한 사항이 TCP 체크섬 오프 로딩에 적용됩니다).

  • 네트워크 인터페이스 펌웨어 또는 드라이버 내에서의 버그 또는 경합상태로 인해 손상된 상태.

  • 노드간에 재조합 네트워크 구성 요소에 의해 주입 된 비트 플립 또는 임의의 손상 (직접 연결, 백투백 연결을 사용하지 않는 경우).

복제 트래픽의 정합성 검사는 기본적으로 비활성화되어 있습니다. 이를 활성화하려면 /etc/bsr.conf의 리소스 구성에 다음과 같은 내용을 추가합니다.

Code Block
resource <resource> {
  net {
    data-integrity-alg <algorithm>;
  }
  ...
}

<algorithm>은 시스템의 커널 구성에서 커널 암호화 API가 지원하는 메시지 해싱 압축 알고리즘입니다. Windows 에서는 crc32c 만 지원합니다.

양 노드의 리소스 구성을 똑같이 변경한 후, 양 노드에서 bsradm adjust <resource>를 실행하여 변경사항을 적용시킵니다. 

온라인 정합성 검사

온라인 정합성 검사는 장치 운영 중에 노드 간의 블록별 데이터의 정합성을 확인하는 기능입니다. 정합성 검사는 네트워크 대역폭을 효율적으로 사용하고 중복된 검사를 하지 않습니다.

온라인 정합성 검사는 한 쪽 노드에서(verification source) 특정 리소스 스토리지상의 모든 데이터 블럭을 순차적으로 암호화 요약(cryptographic digest)시키고, 요약된 내용을 상대 노드(verification target)로 전송하여 같은 블럭위치의 내용을 요약 비교 합니다. 만약 요약된 내용이 일치하지 않으면, 해당 블럭은 out-of-sync로 표시되고 나중에 동기화대상이 됩니다. 여기서 블럭의 전체 내용을 전송하는 것이 아니라 최소한의 요약본만 전송하기 때문에 네트워크 대역을 효과적으로 사용하게 됩니다.

리소스의 정합성을 검증하는 작업은 운영 중에 검사하기 때문에 온라인 검사와 복제가 동시에 수행될 경우 약간의 복제성능 저하가 있을 수 있습니다. 하지만 서비스를 중단할 필요가 없고 검사를 하거나 검사 후 동기화 과정 중에서 시스템의 다운 타임이 발생하지 않는 장점이 있습니다. 

보통 온라인 정합성 검사에 따른 작업은 OS에서 예약된 작업으로 등록하여 운영 I/O 부하가 적은 시간 대에 주기적으로 수행하는 것이 일반적인 사용법입니다.

활성화

온라인 정합성 검사는 기본적으로 비활성화되어 있는데, bsr.conf 내의 리소스 구성에 다음과 같은 내용을 추가하면 활성화할 수 있습니다.

Code Block
resource <resource> {
   net {
       verify-alg <algorithm>;
   }
   ...
}

algorithm 은 메시지 해싱 알고리즘을 말하며 Windows 에선 crc32c 만 지원합니다.

온라인 검증을 활성화 하기 위해 양 노드의 리소스 구성을 똑같이 변경한 후, 양 노드에서 bsradm adjust <resource>를 실행하여 변경사항을 적용시킵니다.

온라인 정합성 검사 실행

온라인 정합성 검사를 활성화한 후, 다음 명령을 사용하여 검사를 실행할 수 있습니다.

Info

drbdadm verify <resource>

온라인 검사가 실행되면, bsr 은 <resource>에서 동기화되지 않은 블록을 알아내 표시하고 이를 기록합니다. 이때 디바이스를 사용하는 모든 응용 프로그램은 아무런 제약 없이 동작할 수 있으며, 리소스의 역할 변경도 가능합니다.

verify 명령은 디스크 상태를 UpToDate로 변경한 후 검증을 수행합니다. 따라서 초기싱크가 완료된 이후 UpToDate 인 복제 소스 노드 측에서 수행하는 것이 바람직 합니다. 예를 들어, Inconsistent 상태의 디스크 노드 측에서 verify를 수행하면 디스크 상태가 UpToDate로 변경 되어 운영 상 문제가 될 수 있으므로 주의가 필요합니다.

검증이 실행되는 동안 out-of-sync 블록이 감지되면, 검증이 완료된 후에 다음 명령으로 동기화할 수 있습니다. 이 때 동기화가 되는 방향은 Primary 노드에서 Secondary 방향으로 이루어지며 Secondary/Secondary 상태에서는 동기화를 진행하지 않습니다. 따라서 Online 검증에 따른 OOS를 해소하기 위해선 소스 측 노드에 대한 Primary로의 승격이 요구됩니다. 

Code Block
drbdadm disconnect <resource>
drbdadm connect <resource>

자동 검사

정기적으로 정합성 검사를 할 필요가 있다면, 다음과 같은 방법으로 bsradm verify <resource> 명령을 작업 스케줄러에 등록합니다.

우선 노드 중 하나에서 특정 위치에 다음과 같은 내용의 스크립트 파일을 만듭니다. 

Info

drbdadm verify <resource>

모든 리소스를 검증하려면 <resource> 대신 all 키워드를 사용하면 됩니다. 

다음은 schtasks(windows 스케줄 설정 명령어)를 사용해 예약된 작업을 생성하는 예 입니다. 다음과 같이 설정 하면 매주 일요일 자정 42분에 온라인 정합성 검사를 수행하게 됩니다.

...

The example above sets the synchronization band to a ratio of 3 replication to 1 synchronization (4 total). However, the sync ratio is compared to the c-min-rate and if the c-min-rate is higher, it is applied as the c-min-rate value. This ensures that you have the minimum amount of synchronization bandwidth.

Congestion mode

Info

Used only in asynchronous replication.

In environments where the replication band is a variable band (WAN), the replication link can sometimes become congested. This causes the primary node's I/O to wait, resulting in a performance degradation of local I/O. Congestion mode is a configuration to respond to this situation.

When congestion is detected, replication is suspended and buffered data is slowly sent to the target while logging local I/O to OOS. During this process, the primary is in an Ahead data state compared to the secondary, and once it finishes sending buffered data, it automatically enters sync mode to synchronize the OOS areas that failed to replicate.

Here is an example of setting up a congestion policy

In the resource configuration file, set the congestion mode with the on-congestion option item and set the congestion detection threshold with the congestion-fill item.

Code Block
resource <resource> {
  net {
    sndbuf-size 1G;
    on-congestion pull-ahead;
    congestion-fill 900M;
    congestion-extents 5500;
    congestion-highwater 20000;
    ...
  }
  ...
}

The pull-ahead option is used with congestion-fill, congestion-extents, or congestion-highwater. The recommended values for each property are as follows

  • Set congestion-fill to approximately 90% of the size of sndbuf-size. If you are integrating a replication accelerator (DRX), set it to 90% of the DRX buffer. However, if the buffer is allocated a large size, say 10GB or more, the 90% threshold may be too large to detect congestion, so this should be adjusted to a reasonable value through tuning.

  • The recommended value for congestion-extents is 90% of the al-extents setting.

  • congestion-highwater detects congestion based on packet count. It is particularly appropriate for use in DR environments where capacity-based detection of replication congestion is not suitable. It is set to 20000 by default and is enabled by default. It is disabled when set to 0 and has a maximum value of 1000000.

Info

Transmission buffer (sndbuf) and DRX buffer

It is difficult to allocate a large amount of the transmission buffer (sndbuf) set in bsr because it is allocated directly from kernel memory. This will vary depending on your system, but you will usually need to limit the size to within 1GB. Otherwise, if system kernel memory becomes insufficient due to transmission buffer allocation, system operation and performance may be affected.

Therefore, if you need to configure a large buffer, it is recommended to configure it as a DRX buffer.

Disk flush

If the target node suddenly goes down due to power failure during replication, data loss may occur if the disk cache area is not backed up by a battery backup device (BBWC). In order to prevent this in advance, in the process of writing data to the disk of the target, after data is written to the media, the flush operation is always performed to prevent data loss.

The storage device equipped with BBWC does not need to perform the disk flush operation, so it provides an option to disable the flush as follows.

Code Block
resource <resource>
  disk {
    disk-flushes no;
    md-flushes no;
    ...
  }
  ...
}

You should disable device flushing only when running bsr on devices with battery backup write cache (BBWC). Most storage controllers automatically disable the write cache when the battery is exhausted and switch to write through mode when the battery is exhausted.

Consistency verification

Consistency verification is a function that performs replication traffic in real-time in block units during replication or compares block-by-block based on hash summaries to verify that the source and target data are completely matched in whole (used) disk volume units.

Traffic integrity check

bsr can use cryptographic message digest algorithms to verify message integrity between both nodes. When this function is used, bsr generates a message summary of all data blocks, delivers it to the other node, and verifies the integrity of the replication packet at the other node. If the summarized blocks do not match each other, retransmission is requested.

When replicating data, bsr can protect the source data against the following error conditions through this consistency check, and failure to respond to such situations can potentially cause data corruption during replication.

  • Bit errors (bit flips) that occur in data transferred between main memory and the network interface of the transmitting node.

    • If the TCP checksum offload function provided by LAN Card is recently activated, hardware bitflip may not be detected by software.

  • Bit errors that occur on data being transferred from the network interface to the receiving node's main memory (the same applies for TCP checksum offloading).

  • Damage due to a bug or race condition within the network interface firmware or driver.

  • Bit flips or random damage injected by the recombination network component between nodes (if not using direct connection, back-to-back connection).

Replication traffic consistency checking is disabled by default. To enable it, add the following to the resource configuration in /etc/bsr.conf.

Code Block
resource <resource> {
  net {
    data-integrity-alg <algorithm>;
  }
  ...
}

<algorithm> is a message hashing compression algorithm supported by the kernel cryptography API in the system's kernel configuration. On Windows, only crc32c is supported.

After changing the resource configuration of both nodes identically, execute bsradm adjust <resource> on both nodes to apply the changes.

Online Verification

Online Verification is a function to check the consistency of block-specific data between nodes during service is online . it does not duplicate check, and it is basically used to efficiently use network bandwidth and check the area used by the file system.

The online verification sequentially encrypts all data blocks on a specific resource storage at one node (verification source), and then sends the summarized content to a verification target to summarize the contents of the same block location and compare it. If the summarized content does not match, the block is marked out-of-sync and is later synchronized. Here, network bandwidth is effectively used because only the smallest summary is transmitted, not the entire contents of the block.

Since the operation to verify the consistency of the resource is checked during operation, there may be a slight decrease in replication performance when online verification and replication are performed simultaneously. However, there is an advantage that there is no need to stop the service, and there is no downtime of the system during the scan or synchronization process after the scan.

Generally, it is common practice to perform tasks according to online verification as scheduled tasks in the OS and perform them periodically during periods of low operational I/O load.

Enable

Online verification is disabled by default, but can be activated by adding the following entry to the resource configuration in bsr.conf.

Code Block
resource <resource> {
   net {
       verify-alg <algorithm>;
   }
   ...
}

algorithm means the message hashing algorithm, and only supports crc32c in Windows.

To enable online verification, make the same resource configuration changes on both nodes, then run bsradm adjust <resource> on both nodes to apply the changes.

OV run

After enabling online verification, you can run the test using the following command:

Info

drbdadm verify <resource>

When an online verification is executed, bsr finds and displays the unsynchronized block in <resource> and records it. At this time, all applications that use the device can operate without any restrictions, and the role of the resource can also be changed.

The verify command performs a verification after changing the disk status to UpToDate. Therefore, it is desirable to perform UpToDate on the replication source node side after the initial sync is completed. For example, if you perform verification on the disk node side of the Inconsistent state, the disk state is changed to UpToDate, which may cause operational problems.

If an out-of-sync block is detected while verification is running, after verification is complete, you can synchronize with the next command. At this time, the direction of synchronization is from the primary node to the secondary direction, and synchronization is not performed in the secondary/secondary state. Therefore, in order to solve the OOS due to online verification, promotion to the primary on the source side node is required. 

Code Block
drbdadm disconnect <resource>
drbdadm connect <resource>

Automatic verification

If you need to do a regularity check, register the bsradm verify <resource> command to the task scheduler in the following way.

First, create a script file with the following contents in a specific location on one of the nodes.

Info

drbdadm verify <resource>

To verify all resources, you can use the all keyword instead of <resource>.

The following is an example of creating a scheduled task using schtasks (windows schedule setting command). With the following settings, online verification is performed every Sunday at 00:42 AM.

Code Block
 schtasks /create /tn "drbd_verify" /tr "%wdrbd_path%\verify.bat" /sc WEEKLY /D sun /st 00:42

Persist Role

While resource roles can be changed based on operational circumstances, sometimes you may want to persist roles. (BSR 1.7.3 and later)
A resource with persist-role set will continue to have the resource role explicitly specified (with the bsradm command) at the time of restart. This works in any situation where the replication service or system reboots, causing the resource to restart.

Code Block
resource <resource> {
  options {
    persist-role yes;
  }
  ...
}

One-way replication

If you always want to have only one-way replication from the primary node to the standby node, without swtichover or failover, consider the target-only attribute on the standby node side. (BSR 1.7.3 and later)

  • Set the persist-role attribute described above in the resource options section to fix the roles of the primary and standby nodes.

  • Set the target-only attribute on the standby node side to force the replication/synchronization direction from the primary node to the standby node only.

A target-only node is prohibited from acting as a source in all replication/sync operations, including explicit commands, and can only have a target role; any manual synchronization or promotion commands that act as a source are blocked (but promotion is allowed on disconnection).

Code Block
resource <resource> {
  options {
    persist-role yes;
  }
  
  on active {
    ...
  }
  
  on standby-DR {
    ...
    options {
      target-only yes;
      ...
    }
  }
  ...
}
Info

Verify data on a target-only node
After disconnecting replication, you can verify data by promoting it. At the time of promotion to verify data, SB has occurred, so to return to replication, demote again and process as SB resolution.