Info

About mount operation

There is a difference between Windows and Linux OS for mount behavior. In Linux, the mounting process to use the volume is required manually, but in the Windows environment, mounting of the volume is performed automatically at the shell level of the operating system, so no separate mount command is required. Therefore, Linux requires an additional mount operation to use the volume after promotion.

Demotion

The transition from Primary role to Secondary role is called demotion.

Code Block
bsradm secondary <resource>

Info
On Linux, unmounting of the volume is required before performing a demotion. In the Windows environment, there is no need for a separate unmounting process since the unmount command is performed internally.

Unmounting and demotion of resources entails switching the role to Secondary as the heaviest task among the command operations of bsr, and reflecting all data pending replication to the target side. This is the basic operation structure for matching data consistency between the replication source and the target. This operation ensures data consistency between the primary and secondary at the time of demoting. Therefore, in the process of unmounting and demoting, it is necessary to keep in mind a certain amount of latency that is required to reflect all pending data to the target.

Info

manually fail-over

The procedure for manual transfer is as follows.

Stop all applications or services using the bsr device on the primary node, and demote the resource to secondary (after umounting the volume on Linux).
Code Block
bsradm secondary <resource>
Execute the following command on the node you want to promote to primary. Restart the service (after mounting the volume on Linux case).
Code Block
bsradm primary <resource>

Resource down

You can stop the resource with the bsradm down command. down stops in the reverse order of the up process described above, and if the resource was in the promoted state, demotes first. In short, resource demotion, replication disconnection, volume detach, and resource release in the following order.

Info
On Linux, umount for volumes must be preceded.

Code Block
bsradm down <resource>

demotion

If the resource was promoted, demote it first.

disconnect

The replication is stopped by disconnecting the connection. Disconnection can also be performed with the disconnect individual command.

...

Note

bsr defaults to FastSync, which synchronizes only the areas used in the file system. However, if the file system of the replication volume is already damaged for some reason, FastSync based on the damaged information of the file system becomes impossible. In preparation for this situation, bsr performs an integrity check (fsck) of the file system before performing the initial synchronization, and if the file system is broken, the initial synchronization fails.

In this case, you will need to manually recover the file system and try to initialize the resource again.

Demotion

The transition from Primary role to Secondary role is called demotion.

Code Block
bsradm secondary <resource>

Info
On Linux, unmounting of the volume is required before performing a demotion. In the Windows environment, there is no need for a separate unmounting process since the unmount command is performed internally.

Unmounting and demotion of resources entails switching the role to Secondary as the heaviest task among the command operations of bsr, and reflecting all data pending replication to the target side. This is the basic operation structure for matching data consistency between the replication source and the target. This operation ensures data consistency between the primary and secondary at the time of demoting. Therefore, in the process of unmounting and demoting, it is necessary to keep in mind a certain amount of latency that is required to reflect all pending data to the target.

Info

manually fail-over

The procedure for manual transfer is as follows.

Stop all applications or services using the bsr device on the primary node, and demote the resource to secondary (after umounting the volume on Linux).
Code Block
bsradm secondary <resource>
Execute the following command on the node you want to promote to primary. Restart the service (after mounting the volume on Linux case).
Code Block
bsradm primary <resource>

Resource down

You can stop the resource with the bsradm down command. down stops in the reverse order of the up process described above, and if the resource was in the promoted state, demotes first. In short, resource demotion, replication disconnection, volume detach, and resource release in the following order.

Info
On Linux, umount for volumes must be preceded.

Code Block
bsradm down <resource>

demotion

If the resource was promoted, demote it first.

disconnect

The replication is stopped by disconnecting the connection. Disconnection can also be performed with the disconnect individual command.

If synchronization or replication is in progress and the replication is attempted to be disconnected, the disconnection may be suspended for a period of time. This is because the command to cut off replication is delivered to both the local and peer nodes, so if there is a large amount of data already buffered for replication, the command delivery may be delayed depending on the sequential processing structure. If you want to ignore this delay, you can also force a disconnect locally by using the --force option. If the connection is forcibly disconnected, the connection can be quickly disconnected, but all pending data for replication or synchronization will be discarded, so it should be considered that out-of-sync (OOS) of the data can occur.

...

Frees all memory, threads allocated for the resource.

Reconfigurations

The bsr basically support changing the resource properties of bsr basically support changing the settings in operation (runtime). This is called dynamic setting (change). However, some of these essential properties do not support dynamic settings and must be reconfigured in a static way to restart and apply resources after changing the settings in the configuration file. In other words, in case of static setting, resource restart is required.

...

Change the configuration file and make real-time changes through the bsradm adjust command. Most properties, except some special settings, such as changing the replication protocol, can be changed in this way.

Info

복제 프로토콜 변경

운영 중 복제 프로토콜을 변경하기 위해서 프로토콜, 송신버퍼, 혼잡제어 설정을 같이 변경해야 합니다.

먼저

Change replication protocol

To change the replication protocol during operation, the protocol, transmission buffer, and congestion control settings must be changed together.

First, delete the peer connection with the bsrsetup del-peer <resource> <node-id> 명령으로 peer 연결을 삭제합니다.
양 노드 리소스 파일의 sndbuf-size 의 크기, 프로토콜, 혼잡제어 설정을 조정합니다.
bsradm adjust <resource> 로 적용합니다.

정적 설정

복제 구성을 위한 필수적인 설정(노드 ID, 볼륨 정보 등)의 변경이 필요할 경우 리소스 down 을 선행한 후 설정을 변경해야 합니다. 구성파일을 변경한 후 다시 up 하여 리소스가 재시작되는 시점에 변경된 설정이 반영됩니다.

전체 재구성

구성을 완전히 변경해야 하거나 디스크 장애등을 위한 복구가 필요한 경우 리소스 전체를 재 구성해야 합니다. 이 경우에는 먼저 운영 중인 리소스를 down 한 후 구성을 변경하고 메타 재 초기화를 수행하여 리소스를 재 기동해야 합니다.

Info
Windows 의 경우 전체 재구성 과정에서 볼륨에 걸려있는 락을 해제해야 할 경우가 있습니다. 이럴 때에 bsrcon 유틸리티의 /m 옵션을 사용하여 볼륨락을 해제할 수 있습니다.

메타 디스크를 초기화하면 볼륨에 대한 초기 동기화의 절차를 다시 수행해야 합니다.

볼륨 크기 조정

구성된 리소스의 볼륨은 운영상황에 따라 크기를 확장하거나 축소해야 할 경우가 생깁니다. 이를 위해서는 복제 볼륨의 크기를 조정하는 다음과 같은 별도의 방법을 사용해야 합니다. 볼륨 크기 조정은 플랫폼에 따라 차이가 있으며 온라인 볼륨 확장에 대해서 지원하고 볼륨 축소는 전체 재구성의 작업 절차를 따라야 합니다.

윈도우즈

윈도우즈에서 복제 운영 중 양노드의 볼륨 크기를 조정하기 위해선 먼저 복제 연결을 끊고 양 노드를 Primary 상태로 만들어야 합니다. Secondary 상태에선 볼륨이 락으로 잠겨있기 때문에 볼륨의 크기조정이 불가합니다. 양노드를 Primary 로 승격한 상태이므로 복제 클러스터는 스플릿브레인 상태가 되고, 볼륨의 크기를 조정하는 작업을 수행하고 나면 원래 Secondary 였던 노드를 강등한 후 Secondary 노드를 희생노드로 하여 스플릿 브레인을 해결합니다.

이렇게 하면 전체 볼륨의 크기가 늘어나고 새롭게 늘어난 크기의 볼륨 영역만큼 소스 기준으로 동기화가 되어 온라인 중 볼륨 확장이 가능해 집니다. 물론 늘어난 타깃의 볼륨 크기는 최소한 소스 보다는 커야 합니다.

Info

볼륨의 크기가 커지는 만큼 메타디스크의 크기도 자동으로 늘어나는데(볼륨 확장시점에 bsr 내부적으로 처리합니다), 필요한 용량만큼 여유 분이 확보되어 있지 않다면 볼륨확장이 실패하게 됩니다. 따라서 온라인 볼륨 확장을 위해선 초기 리소스 구축과정에서 이를 염두에 두고 메타디스크 크기를 산정할 필요가 있습니다.

리눅스

리눅스에서 온라인 볼륨 확장을 수행하려면 다음과 같은 조건을 충족해야 합니다.

bsr의 블럭장치가 LVM 과 같은 볼륨 관리자와 함께 구성되어 있어야 합니다.
소스와 타깃 노드는 복제 연결상태를 Connected 상태로 유지해야 합니다.

...

command.
Adjust the protocol, congestion control settings and sndbuf-size in both node resource files.
Apply as bsradm adjust <resource>.

Static settings

If you need to change the essential settings (node ID, volume information, etc.) for the replication configuration, you must change the settings after resource down. After changing the configuration file, up again to reflect the changed settings at the time the resource is restarted.

Full reconfigurations

If you need to completely change the configuration or recover from a disk failure, you must reconfigure the entire resource. In this case, you must first down the running resource, then change the configuration and perform meta-reinitialization to restart the resource.

Info
In the case of Windows, it may be necessary to release the lock on the volume during the entire rebuild process. In this case, you can release the volume lock using the /m option of the bsrcon utility.

Initializing the meta disk will require you to redo the initial synchronization of the volume.

Resizing volume

The volume of the configured resource may need to be expanded or shrunk depending on the operational situation. To do this, you need to use a separate method to resize the replication volume: Resizing volume varies by platform, supports for only online growing volume. but shrinking volume must follow the full reconfiguration's working procedures.

Windows

To growing the volume size of both nodes during replication operation in Windows, you must first disconnect the replication and bring both nodes to Primary. In the secondary state, the volume cannot be resized because the volume is locked with a bsr. Since both nodes are promoted to Primary, the replication cluster enters the split-brain state, and after performing the operation of resizing the volume, demote the node that was the original Secondary, and then resolve the split-brain by using the Secondary node as a victim node.

This increases the size of the entire volume and synchronizes by source as much as the newly increased volume area, allowing online growing volume. Of course, the increased target volume size should be at least larger than the source.

Info
As the volume size increases, the size of the meta disk automatically increases (bsr handles it internally at the time of volume expansion). If there is not enough free space, the volume expansion will fail. Therefore, in order to expand the online volume, it is necessary to calculate the meta disk size with this in mind during the initial resource configuration.

Linux

To perform online growing volume on Linux, the following conditions must be met:

bsr's block device must be configured with a volume manager such as LVM.
The source and target nodes must remain connected to the mirror connection.

Put the node in Primary state, increase the volume of both nodes through LVM, and issue the following command on one node to recognize the newly increased size in bsr.

Code Block
bsradm resize <resource>

볼륨의 늘어난 영역에 대한 새로운 동기화가 진행됩니다.

리소스 삭제

구성파일을 삭제 함으로써 리소스가 삭제됩니다. 보통 운영 중일 경우에는 다음의 절차를 통해 리소스를 삭제 합니다.

운영 중인 리소스를 down 합니다.
bsrcon /m 을 통해 볼륨에 걸려있는 락을 해제 합니다.
리소스 구성파일을 삭제 합니다.

조회

버전

bsradm /V 명령을 통해 bsr의 버전 정보를 확인합니다A new resync is in progress for the increased area of the volume.

Delete resource

The resource is deleted by deleting the configuration file. In normal operation, resources are deleted through the following procedure.

Down the running resource.
- For Windows, release the lock on the volume via bsrcon /m.
Delete the resource configuration file.

Inquiry

Version

Check the version information of bsr through the bsradm / V command.

Code Block

[root@bsr-01 nglee]# bsradm -V
BSRADM_BUILDTAG=GIT-hash:3dca67e82d331e95121288a57898fcda13357e94 build by nglee@NGLEE-1,2020-01-29 13:50:48
BSRADM_API_VERSION=2
BSR_KERNEL_VERSION_CODE=0x000000
BSR_KERNEL_VERSION=0.0.0
BSRADM_VERSION_CODE=0x010600
BSRADM_VERSION=1.6.0-PREALPHA3

상태정보

...

Status Information

Print out basic status information.

Code Block
>bsradm status r0 r0 role:Secondary disk:UpToDate nina role:Secondary disk:UpToDate nino role:Secondary disk:UpToDate nono connection:Connecting

상세 정보를 출력합니다Print detailed information.

info

Code Block

C:\>bsrsetup status r0 --verbose --statistic
r0 node-id:0 role:Secondary suspended:no
    write-ordering:flush
  volume:0 minor:2 disk:Inconsistent
      size:4096000 read:0 written:0 al-writes:0 bm-writes:0 upper-pending:0
      lower-pending:0 al-suspended:no blocked:no
  WIN2012R2_2 node-id:1 connection:Connected role:Secondary congested:no
    volume:0 replication:Established peer-disk:Inconsistent
        resync-suspended:no
        received:0 sent:0 out-of-sync:0 pending:0 unacked:0

Performance indicator

성능지시자

sent (network send).

...

The amount of network data transmitted to the other node through the network connection. (Kibyte)
received (network receive).

...

The amount of network data received from the other node through the network connection. (Kibyte)
written (disk write).

...

Net data recorded on the local hard disk. (Kibyte)
read (disk read).

...

Net data read from the local hard disk. (Kibyte)
al-writes (activity log).

...

The number of updates to the activity log area of metadata.
bm-writes (bit map).

...

The number of updates to the bitmap area of the metadata.
upper-pending (application pending I/O ).

...

The number of I/Os that are not completed among the I/Os transferred from the upper to bsr and are being processed by bsr.
lower-pending (subsystem open count).

...

The number of (unclosed) open times for the local I/O sub-

...

system performed by bsr.
pending. The number of requests that were requested from the local node to the other node but were not acked.
unacked (unacknowledged).

...

The number of requests that were received by the other node but did not ack.
write-ordering (write order).

...

Indicates the current disk writing method.
out-of-sync.

...

Indicates the amount of storage that is currently out of sync. (Kibytes)
resync-suspended.

...

Whether to stop resynchronization. Possible values are no, user, peer, dependency
blocked.

...

Local I/O

...

congestion status
- no:

...

- No congestion
- upper:

...

- Congestion on the upper device
- lower:

...

- Lower Disk congestion
congested.

...

This flag tells you if the TCP send buffer on the replication connection is over 80% full.
- yes:

...

- congested
- no:

...

- no congested

Print the network connection status.

연결상태

Code Block
C:\>bsradm cstate r0 Connected
Info

Connection status

StandAlone.

...

Disconnecting. 연결이 끊어지는 동안의 일시적인 상태입니다. 다음 상태: StandAlone

...

Unconnected. 연결을 시도하기 전의 일시적인 상태입니다. 다음 상태: Connecting 또는 Connected.

...

Timeout. 상대 노드와의 통신 시간 초과에 따른 일시적인 상태입니다. 다음 상태: Unconnected

...

BrokenPipe. 상대 노드와의 연결이 끊어진 후 일시적으로 표시되는 상태입니다. 다음 상태: Unconnected

...

NetworkFailure. 상대 노드와의 연결이 끊어진 후 일시적으로 표시되는 상태입니다. 다음 상태: Unconnected

...

ProtocolError. 상대 노드와의 연결이 끊어진 후 일시적으로 표시되는 상태입니다. 다음 상태: Unconnected

...

TearDown. 상대 노드가 연결 종료 중임을 나타내는 일시적인 상태입니다. 다음 상태: Unconnected

...

Connecting. 상대 노드가 네트워크에서 확인 되기를 기다리고 있는 상태입니다.

...

Connected. TCP 연결이 설정되었으며, 상대 노드로부터 첫 번째 네트워크 패킷을 기다립니다.

Info

복제상태

Off 상대노드와 연결이 끊어졌거나, 복제가 진행되지 않는 상태입니다.
Established. 정상적으로 연결된 상태입니다. 연결이 성립되었으며, 데이터 미러링이 활성화됩니다.
StartingSyncS. 로컬 노드가 소스이고, 사용자에 의해 전체 동기화가 시작된 상태입니다. 다음 상태: SyncSource 또는 PausedSyncS
StartingSyncT. 로컬 노드가 타겟이고, 사용자에 의해 전체 동기화가 시작된 상태입니다. 다음 상태: WFSyncUUID
WFBitMapS. 부분 동기화가 시작됩니다. 다음 상태: SyncSource 또는 PausedSyncS
WFBitMapT. 부분 동기화가 시작됩니다. 다음 상태: WFSyncUUID
WFSyncUUID. 동기화가 시작되려고 하는 상태입니다. 다음 상태: SyncTarget 또는 PausedSyncT
SyncSource. 로컬 노드가 소스이고, 동기화가 진행 중인 상태입니다.
SyncTarget. 로컬 노드가 타겟이고, 동기화가 진행 중인 상태입니다.
VerifyS. 로컬 노드가 소스이고, On-line 디바이스 검증이 실행 중입니다.
VerifyT. 로컬 노드가 타겟이고, On-line 디바이스 검증이 실행 중입니다.
PausedSyncS. 로컬 노드가 소스이고, 다른 동기화 작업 완료에 대한 의존성 또는 수동 명령 (bsradm pause-sync)에 의해 동기화가 일시 정지된 상태입니다.
PausedSyncT. 로컬 노드가 타겟이고, 다른 동기화 작업 완료에 대한 의존성 또는 수동 명령 (bsradm pause-sync)에 의해 동기화가 일시 정지된 상태입니다.
Ahead. 로컬노드가 네트워크 혼잡상태에 도달하여 복제데이터를 전송할 수 없는 상태입니다. (상대노드로 OOS Info 전송)
Behind. 상대노드가 네트워크 혼잡상태에 도달하여 복제데이터를 수신할 수 없는 상태입니다. (이후 SyncTarget 상태로 전환)

연결상태와 복제상태를 구분하여 표기합니다. 양 노드가 연결되기 전 까지는 연결상태가 StandAlone 에서 Connecting 사이에서 변화합니다. 연결이 성립된 이후는 연결상태가 Connected로 유지되고, 복제 상태는 Established 에서 부터 운영상황에 따라 여러가지 상태로 전환 됩니다.

복제 상태는 특정 시점에 하나의 상태만을 가질 수 있으며, 특히 노드가 소스 상태이면 피어노드는 타깃의 상태여야 합니다.

...

The network configuration is not possible because the resource has not yet been connected, the user has disconnected using bsradm disconnect, or has been disconnected for reasons such as authentication failure or split-brain.
Disconnecting. This is a temporary state while the connection is lost. Next status: StandAlone
Unconnected. This is a temporary state before trying to connect. Next status: Connecting or Connected.
Timeout. This is a temporary state due to the communication timeout with the other node. Next status: Unconnected
BrokenPipe. This status is displayed temporarily after the connection with the other node is disconnected. Next status: Unconnected
NetworkFailure. This status is displayed temporarily after the connection with the other node is disconnected. Next status: Unconnected
ProtocolError. This status is displayed temporarily after the connection with the other node is disconnected. Next status: Unconnected
TearDown. This is a temporary state indicating that the other node is ending the connection. Next status: Unconnected
Connecting. It is waiting for the peer node to be confirmed on the network.
Connected. TCP connection established, waiting for the first network packet from the other node.

Replication status

Off The connection with the other node is disconnected, or replication is not in progress.
Established. It is connected normally. Connection is established, data mirroring is enabled.
StartingSyncS. The local node is the source, and full synchronization has been initiated by the user. Next status: SyncSource or PausedSyncS
StartingSyncT. The local node is the target, and full synchronization has been started by the user. Next status: WFSyncUUID
WFBitMapS. Partial synchronization begins. Next status: SyncSource or PausedSyncS
WFBitMapT. Partial synchronization begins. Next status: WFSyncUUID
SyncSource. The local node is the source and synchronization is in progress.
SyncTarget. The local node is the target and synchronization is in progress.
VerifyS. The local node is the source, and on-line device verification is running.
VerifyT. The local node is the target, and on-line device verification is running.
PausedSyncS. The local node is the source, and synchronization is paused by a dependency on other synchronization tasks to complete or by manual commands (bsradm pause-sync).
PausedSyncT. The local node is the target, and synchronization is paused by dependency on the completion of another synchronization operation or by manual command (bsradm pause-sync).
Ahead. The local node has reached the network congestion status and is unable to transmit the replication data. (send OOS Info to the peer node)
Behind. The partner node has reached the network congestion status and cannot receive the replicated data. (Afterward, switch to SyncTarget state)

Connection status and replication status are indicated separately. The connection status changes from StandAlone to Connecting until both nodes are connected. After the connection is established, the connection status is maintained as Connected, and the replication status is changed from Established to various status depending on the operation status.

The replication state can have only one state at a time, especially if the node is in the source state, the peer node must be in the target state.

The following is the role of resource

Code Block
C:\Program Files\bsr>bsradm role r0 Primary/Secondary

리소스는 다음 중 하나의 역할을 가집니다.

...

Resources have one of the following roles:

Primary. It can be read and written. Only one node within a cluster can have this role.
Secondary. primary 노드로부터 디스크 변경 분을 업데이트 하며 읽기와 쓰기가 불가능한 상태입니다. 하나 또는 여러 노드에서 가질 수 있는 역할입니다.
Unknown. 리소스의 역할을 알 수 없는 상태입니다. disconnected 모드에서 상대 노드의 역할을 표시할 때 사용되며, 로컬 노드의 역할을 표시할 때는 사용되지 않습니다.

...

Disk changes are updated from the primary node, and are not readable or writable. A role that can be held on one or multiple nodes.
Unknown. The role of the resource is unknown. Used in disconnected mode to indicate the role of the peer node, not used to indicate the role of the local node.

The following is the disk status.

Code Block
C:\Program Files\bsr>bsradm dstate r0 UpToDate/UpToDate

로컬 및 원격 디스크의 상태는 다음 중 하나의 값을 가집니다.

...

Local and remote disks have one of the following values:

Diskless. The local block device is not assigned to the bsr driver. This state is when the resource has never been attached on the backup device, has been manually detached with the bsradm detach <resource> command, or has been automatically detached after a lower-level I / O 오류 후에 자동으로 분리된 경우 이 상태가 됩니다error.
Attaching. 메타 데이터를 읽는 동안의 일시적인 상태입니다Transient state while reading metadata.
Failed. 로컬 블록 디바이스의 This is a temporary state according to the I/O 실패 보고에 따른 일시적인 상태입니다. 다음 상태는 Diskless 입니다.
Negotiating. 이미 연결된 디바이스에서 Attach 가 실행되었을 때 일시적으로 이 상태가 됩니다.
Inconsistent. 데이터가 불일치한 상태입니다. 새로운 리소스를 구성했을 경우 양 노드의 디스크는 이 상태가 됩니다. 또는 동기화 중인 타겟 노드의 디스크 상태입니다.
Outdated. 리소스의 데이터가 일치하지만, 최신 데이터는 아닌 상태입니다.
DUnknown. 네트워크 연결을 사용할 수 없는 경우, 원격 디스크의 상태를 표시하기 위해 사용됩니다.
Consistent. 노드가 연결되는 과정에서 데이터는 일치한 상태로 간주된 일시적 상태입니다. 연결이 완료되면, UpToDate 인지 Outdated 인지 결정됩니다.
UpToDate. 데이터 정합성이 일치하고 최신의 상태입니다. 복제 중의 일반적인 상태입니다.

Info

bsr은 Inconsistent 데이터와 Outdated 데이터를 구분합니다. Inconsistent 데이터란 어떤 식으로든 접근이 불가능하거나 사용할 수 없는 데이터를 말합니다. 대표적인 예로 동기화 진행 중인 타겟 쪽 데이터가 inconsistent 상태 입니다. 타겟 쪽 데이터는 동기화가 진행중일 때 일부는 최신이지만 일부는 지난 시점의 데이터 이므로 이를 하나의 시점인 데이터로 간주하는 것이 불가능합니다. 그런 경우에는 장치에 파일시스템이 있는 경우 그 파일시스템은 마운트(mount) 될 수 없거나 파일시스템 체크 조차도 수행되지 못하는 상태 일 수 있습니다.

Outdated 데이터는 특정시점의 데이터 정합성은 보장되지만 프라이머리(Primary) 노드와 최신의 데이터로 동기화되지 않은 데이터 입니다. 이런 경우는 임시적이든 영구적이든 복제 링크가 중단할 경우 발생합니다. 연결이 끊어진 Oudated 데이터를 사용하는데 별 문제는 없지만 이것은 결국 지난 시점의 데이터 입니다. 이런 데이터에서 서비스가 되는 것을 막기 위해 bsr은 Outdated 데이터를 가진 리소스에 대한 승격을 기본적으로 허용하지 않습니다. 그러나 필요하다면(긴급한 상황에서) Outdated 데이터를 강제로 승격할 수는 있습니다.

이벤트

다음과 같은 명령으로 실시간 이벤트 발생 상태를 확인할 수 있습니다. events2 명령은 '--statistics', '--timestamp' 옵션과 함께 사용할 수 있습니다.

Info

C:\Program Files\bsr\bin>bsrsetup events2 --now r0
exists resource name:r0 role:Secondary suspended:no
exists connection name:r0 peer-node-id:1 conn-name:remote-host connection:Connected role:Secondary
exists device name:r0 volume:0 minor:7 disk:UpToDate
exists device name:r0 volume:1 minor:8 disk:UpToDate
exists peer-device name:r0 peer-node-id:1 conn-name:remote-host volume:0
replication:Established peer-disk:UpToDate resync-suspended:no
exists peer-device name:r0 peer-node-id:1 conn-name:remote-host volume:1
replication:Established peer-disk:UpToDate resync-suspended:no
exists -

동기화 속도 조정

동기화가 백그라운드에서 동작하면 타깃의 데이터는 일시적으로 불일치(Inconsistent)한 상태가 됩니다. 이러한 Inconsistent 상태는 가능한 짧게 유지해야 정합성 보장 측면에서 좋기 때문에 동기화 속도가 충분하게 설정되어 있어야 유리합니다. 그러나 복제와 동기화는 같은 네트워크 대역을 공유하고 있으며 만약 동기화 대역이 높게 설정된다면 상대적으로 복제 대역은 적게 부여될 수 밖에 없습니다. 복제 대역이 낮아지면 로컬의 I/O latency에 영향을 주게 되고 결과적으로 운영 서버의 로컬 I/O 성능 저하를 가져오게 됩니다. 복제든 동기화든 어느 한쪽이 일방적으로 대역을 많이 점유하면 상대적으로 다른 쪽 동작에 영향을 주게 되므로 bsr 은 복제 대역을 최대한 보장하면서 동기화 대역을 복제 상황에 따라 적당히 조절하는 가변대역 동기화를 구현하고 있으며 이를 기본 정책으로 사용합니다. 이와 반대로 고정대역 동기화 정책은 복제에 관계없이 동기화 대역을 항상 보장하는 방식으로 서버 운영 중에 사용할 경우 로컬 I/O 성능의 저하를 가져올 수 있으므로 일반적으로는 권장되지 않고 특수한 상황에서만 사용해야 합니다.

Info

복제와 동기화

복제는 로컬에서 발생하는 디스크의 변경 분 I/O 를 타깃에 실시간 반영하는 동작입니다. 변경 분 I/O가 로컬 디스크에 쓰여지는 문맥에서 복제가 수행 되므로 로컬 I/O 지연에 영향을 줍니다.
동기화는 전체 디스크 볼륨 중 동기화 되지 않은 영역(out-of-sync)을 대상으로 소스 측 디스크의 데이터를 타깃의 데이터와 일치시키는 동작입니다. 0번 디스크 섹터를 시작점으로 하여 볼륨의 마지막 섹터 까지 순차적으로 처리됩니다.

이러한 차이를 명확하게 구분하기 위해 bsr은 복제와 동기화를 항상 구분하여 기술합니다.

Info

대기노드의 최대 디스크 쓰기 속도 보다 높은 값으로 동기화 속도를 설정하는 것은 의미가 없습니다. 대기노드는 진행 중인 디바이스 동기화의 타깃이 되기 때문에 대기 노드가 허용하는 I/O 서브시스템의 쓰기 속도보다 동기화 속도가 더 빠를 수는 없습니다. 같은 이유로, 복제 네트워크에서 사용할 수 있는 대역폭보다 더 높은 값으로 동기화 속도를 설정하는 것도 의미가 없습니다.

고정대역 동기화

백그라운드에서 수행되는 재동기화를 위해 사용되는 최대 대역폭은 리소스의 resync-rate 옵션에 따라 결정됩니다. 해당 옵션은 다음과 같이 /etc/bsr.conf 리소스 구성의 disk 섹션에 포함되어 있습니다.

Code Block
resource <resource> { disk { resync-rate 40M; c-min-rate 40M; c-plan-ahead 0; ... } ... }

resync-rate, c-min-rate 설정은 초당 바이트 단위로 지정됩니다. 기본 단위는 Kibibyte이고 4096의 값은 4MiB로 해석됩니다.

Info

Important

c-plan-ahead 파라미터가 양수 값으로 설정되어 있을 경우 동적으로 동기화 속도를 조절합니다. 이 값은 기본적으로 20 으로 설정되어 있으며 고정적인 동기화 속도를 위해서는 이 값을 0 으로 설정해야 합니다.
c-min-rate는 복제와 동기화가 동시에 진행될 때 최소한의 동기화 속도를 설정하기 위한 파라미터입니다. 이 값은 기본적으로 250k 로 설정되어 있으며 만약 고정적인 동기화 속도를 보장하려면 resync-rate와 동일한 값 으로 설정해야 합니다.

가변대역 동기화

다중 리소스가 복제/동기화 네트워크를 공유하는 구성일 경우 고정대역 동기화는 최적의 방법이라 할 수 없습니다. 동일한 복제 네트워크를 공유하기 때문에 특정 복제 리소스 채널에 대해서 동기화율이 점유 당할 경우 다른 리소스들은 고정 동기화율이 보장되지 않게 됩니다. 이 경우, 가변대역 동기화를 통해 각각의 복제 채널의 동기화율을 동적으로 조정 하도록 구성하여 동기화율이 점유당하는 것을 완화시킬 수 있습니다. bsr은 이 모드에서 초기 동기화 속도를 결정한 후 자동 제어 루프 알고리즘을 통해 지속적으로 동기화 속도를 조정합니다. 이 알고리즘은 포그라운드 복제가 가능하도록 충분한 대역폭을 보장하며, 백그라운드 동기화가 포그라운드 I/O에 미치는 영향을 크게 완화시킵니다.

가변대역 동기화를 위한 최적의 구성은 사용 가능한 네트워크 대역폭, 응용 프로그램 I/O 패턴 및 복제 링크 혼잡상황에 따라 달라질 수 있으며, 복제 가속기(DRX) 사용 여부에 따라 최적의 구성 설정이 달라질 수 있습니다.

Info

동기화 속도 추정

아래와 같은 수식으로 동기화 시간을 추정할 수 있습니다.

t_resync = D/R

t_resync는 동기화 예상 시간입니다.
D는 별다른 영향(복제 링크가 끊어진 상황에서 데이터가 수정되는 등)을 거의 받지 않는다는 가정하에서 동기화될 데이터의 크기를 말합니다.
R은 조정 가능한 동기화율이며 이는 복제 네트워크 환경 및 I/O 서브시스템의 처리능력에 따라 한계 값이 달라집니다.

효율적인 동기화

bsr 에선 효율적인 동기화를 위해 체크섬 기반의 동기화, 운송동기화, 비트맵 제거 동기화 등 다양한 기능을 제공합니다.

체크섬 기반 동기화

체크섬 데이터 요약을 사용하면 bsr의 동기화 알고리즘의 효율성을 더욱 개선할 수 있습니다. 체크섬 기반 동기화는 동기화하기 전에 블록을 읽고 현재 디스크에 있는 내용의 해시(hash) 요약을 구한 다음, 상대 노드로부터 같은 섹터를 읽어 구한 해쉬 요약 내용과 비교합니다. 해시 내용이 일치하면 해당 블럭에 대한 동기화 쓰기(re-write)를 생략하고 일치하지 않을 경우 동기화 데이터를 전송합니다. 이 방식은 동기화 해야될 블럭을 단순히 덮어쓰는 기존 방식에 비해 성능에서 유리할 수 있으며 연결이 끊어져 있는(disconnect 상태) 동안 파일 시스템이 섹터에 같은 내용을 다시 썼다면 해당 섹터에 대해선 재동기화를 생략하게 되므로 전체적으로 동기화 시간을 단축시킬 수 있습니다.

운송 동기화

디스크를 직접 가져와서 구성하는 운송 동기화는 아래와 같은 상황에 적합합니다.

초기 동기화 할 데이터의 량이 매우 큰 경우(수백 기가바이트 이상)
거대한 데이터 사이즈에 비해 복제할 데이터의 변화율이 적을 것으로 예상되는 경우
소스, 타깃 사이트간 가용 네트워크 대역폭이 제한적인 경우

위와 같은 상황에서 직접 디스크를 가져다가 동기화 하지 않고 일반적인 디바이스 동기화 방법으로 초기화를 진행한다면 동기화를 하는 동안 매우 오랜 시간이 걸릴 것입니다. 디스크 크기가 크고 물리적으로 직접 복사하여 초기화 시킬 수 있다면 이 방법을 권장합니다.

일단 한가지 상황을 가정해보겠습니다. Primary인 상태로 연결이 끊어진 로컬 노드가 있습니다. 즉, 디바이스 구성은 완료되었고 동일한 bsr.conf 사본은 양 노드에 모두 존재합니다. 로컬 노드에서 초기 리소스 승격(initial resource promotion)을 위한 명령들은 실행했지만 리모트 노드는 아직 연결되어 있지 않습니다.

로컬 노드에서 다음 명령을 실행합니다.
Code Block
bsradm new-current-uuid --clear-bitmap <resource>

복제 대상이 될 데이터와 그 데이터의 metadata의 사본을 똑같이 생성합니다. 예를 들어, RAID-1 미러에서 hot-swappable drive를 사용할 수도 있을 겁니다. 물론 이 상황에서는 RAID set이 미러링을 지속하기 위해 새로운 drive로 교체해 주어야 할 것입니다. 그러나 여기서 제거했던 디스크 드라이브는 다른곳에서 바로 사용할 수 있는 말 그대로의 사본입니다. 만약 로컬 블록 디바이스가 스냅샷 사본 기능을 지원한다면 이 기능을 사용하면 됩니다.
로컬 노드에서 아래 명령을 실행합니다.
Code Block
bsradm new-current-uuid <resource>

두 번째 명령 실행에서는 --clear-bitmap 옵션이 없습니다.

원본 데이터와 동일한 사본을 물리적으로 직접 가져와서 원격 노드에 사용할 수 있도록 구성 합니다.
물리적으로 디스크를 직접 연결할 수도 있고, 가져온 데이터를 비트단위로 통째로 기존에 가지고 있던 디스크에 복사하여 사용해도 됩니다. 이 과정은 복제한 데이터 뿐만 아니라 메타데이터에도 해 주어야 합니다. 이런 절차가 수용될 수 없다면 이 방법은 더 이상 진행 할 수 없습니다.
원격 노드에서 bsr 리소스를 기동시킵니다.
Code Block
bsradm up <resource>

두 노드가 연결되면 디바이스 전체 동기화(full device synchronization)를 시작하지는 않을 것입니다. 대신에 bsradm--clear-bitmap new-current-uuid 명령을 호출 한 뒤부터 변경된 블럭에 관한 동기화만 자동으로 개시됩니다.

만약 아무런 변화가 없더라도 새로운 Secondary 노드에서 롤백되는 Activity Log에서 다뤄지는 영역에 따라 간단한 동기화가 있을 수 있습니다.

Info

비트맵 제거 동기화

Generates a new current UUID and rotates all other UUID values. This has at least two use cases, namely to skip the initial sync, and to reduce network bandwidth when starting in a single node configuration and then later (re-)integrating a remote site.Available option:

--clear-bitmap

Clears the sync bitmap in addition to generating a new current UUID. This can be used to skip the initial sync, if you want to start from scratch. This use-case does only work on "Just Created" meta data. Necessary steps:

On both nodes, initialize meta data and configure the device. bsradm -- --force create-md res
They need to do the initial handshake, so they know their sizes. bsradm up res
They are now Connected Secondary/Secondary Inconsistent/Inconsistent. Generate a new current-uuid and clear the dirty bitmap. bsradm new-current-uuid --clear-bitmap res
They are now Connected Secondary/Secondary UpToDate/UpToDate. Make one side primary and create a file system. bsradm primary res mkfs -t fs-type $(bsradm sh-dev res)

One obvious side-effect is that the replica is full of old garbage (unless you made them identical using other means), so any online-verify is expected to find any number of out-of-sync blocks.You must not use this on pre-existing data! Even though it may appear to work at first glance, once you switch to the other node, your data is toast, as it never got replicated. So do not leave out the mkfs (or equivalent).This can also be used to shorten the initial resync of a cluster where the second node is added after the first node is gone into production, by means of disk shipping. This use-case works on disconnected devices only, the device may be in primary or secondary role.The necessary steps on the current active server are:

bsrsetup new-current-uuid --clear-bitmap minor
Take the copy of the current active server. E.g. by pulling a disk out of the RAID1 controller, or by copying with dd. You need to copy the actual data, and the meta data.
bsrsetup new-current-uuid minor

Now add the disk to the new secondary node, and join it to the cluster. You will get a resync of that parts that were changed since the first call to bsrsetup in step 1.

혼잡 모드

Info
비동기 방식 복제에서 만 사용합니다.

복제 대역폭이 가변적인 환경(WAN 복제 환경)에서는 때때로 복제 링크가 정체 될 수 있습니다. 이로 인해 Primary 노드의 I/O가 대기하게 되면 로컬 I/O의 성능저하가 발생하기 때문에 바람직하지 않습니다. 이러한 혼잡 상황을 감지할 경우 진행 중인 복제를 일시 중단하도록 구성할 수 있습니다. 대신 이렇게 복제가 중단되는 상황에서는 Primary 측의 데이터 세트가 Secondary의 데이터보다 앞선 상태(Ahead)가 되고 이 앞서간 데이터 블럭들은 OOS(Out-Of-Sync) 로 기록하여 혼잡이 해제되면 이미 기록한 OOS를 백그라운드 재동기화를 통해 해소합니다. 다음은 혼잡 정책을 설정하는 예 입니다.

리소스 구성파일에는 on-congestion 옵션 항목으로 혼잡모드를 설정하고, congestion-fill 항목으로 혼잡에 대한 인식 임계치를 설정합니다.

Code Block
resource <resource> { net { sndbuf-size 20M; on-congestion pull-ahead; congestion-fill 2G; congestion-extents 2000; ... } ... }

pull-ahead 옵션은 congestion-fill 및 congestion-extents와 함께 사용됩니다. congestion-fill의 권장 값은 다음과 같습니다.

복제 가속기(DRX)를 연동하는 경우 DRX 버퍼 크기의 약 90% 로 설정합니다.
DRX를 연동하지 않을 경우엔 sndbuf-size 의 90% 크기로 설정합니다
congestion-extents의 권장 값은 al-extents 설정값의 90%입니다.

디스크 플러시

복제 중 타깃 노드가 전원장애로 인해 갑자기 다운된다면 디스크 캐쉬 영역이 배터리 백업 장치(BBWC)에 의해 백업되어 있지 않을 경우 데이터 유실이 발생할 수 있습니다. 복제에선 이를 미연에 방지하기 위해 데이터를 타깃의 디스크에 쓰는 과정에서 데이터를 미디어에 기록하고 난 후 flush 동작을 항상 수행하여 데이터 유실을 예방 합니다.

BBWC 가 장착된 스토리지 장치에선 디스크 플러시 동작을 굳이 할 필요가 없으므로 다음과 같이 플러시를 비활성화 할 수 있도록 옵션을 제공합니다.

Code Block
resource <resource> disk { disk-flushes no; md-flushes no; ... } ... }

배터리 백업 쓰기 캐시 (BBWC)가있는 장치에서 bsr을 실행할 때만 장치 플러시를 비활성화해야합니다. 대부분의 스토리지 컨트롤러는 배터리가 소진되면 쓰기 캐시를 자동으로 비활성화하고 배터리가 소진되면 쓰기(write through) 모드로 전환합니다.

정합성 검증

정합성 검증은 복제를 수행하는 과정에서 복제 트래픽을 블럭 단위로 실시간 수행하거나 전체 디스크 볼륨 단위로 소스와 타깃의 데이터가 완전히 일치하는지 해쉬 요약을 기반으로 블럭단위로 비교하는 기능입니다.

트래픽 검사

bsr 은 암호화 메시지 요약 알고리즘을 사용하여 양 노드 간의 메시지 정합성을 검증할 수 있습니다. 이 기능을 사용하게 되면 bsr 은 모든 데이터 블록의 메시지 요약본을 생성하고 그것을 상대 노드에게 전달한 후 상대편 노드에서 복제 패킷의 정합성을 확인합니다. 만약 요약된 블럭이 서로 일치하지 않으면 재전송을 요청합니다.

bsr 은 데이터 복제 시 이러한 정합성 검사를 통해 다음과 같은 에러 상황들에 대해 소스 데이터를 보호할 수 있으며, 만약 이와 같은 상황들에 대응하지 못하면 잠재적으로 복제 중 데이터 손상이 야기될 수 있습니다.

주 메모리와 전송 노드의 네트워크 인터페이스 사이에서 전달된 데이터에서 발생하는 비트 오류 (비트 플립).
- 최근 랜카드가 제공하는 TCP 체크섬 오프로드 기능이 활성화 될 경우 하드웨어적인 비트플립이 소프트웨어 적으로 감지되지 않을 수 있습니다
네트워크 인터페이스에서 수신 노드의 주 메모리로 전송되는 데이터에서 발생하는 비트 오류(동일한 사항이 TCP 체크섬 오프 로딩에 적용됩니다).
네트워크 인터페이스 펌웨어 또는 드라이버 내에서의 버그 또는 경합상태로 인해 손상된 상태.
노드간에 재조합 네트워크 구성 요소에 의해 주입 된 비트 플립 또는 임의의 손상 (직접 연결, 백투백 연결을 사용하지 않는 경우).

...

failure report of the local block device. The next state is Diskless.
Negotiating. This is temporarily made when attachment is executed on an already connected device.
Inconsistent. Data is inconsistent. If you have configured new resources, the disks on both nodes will be in this state. Or, the disk status of the target node being synchronized.
Outdated. The data in the resource matches, but it is out of date.
DUnknown. Used to display the status of the remote disk when network connection is unavailable.
Consistent. In the process of connecting nodes, data is a transient state that is considered a match. When the connection is complete, it is determined whether it is UpToDate or Outdated.
UpToDate. Data consistency is consistent and up to date. This is the normal state during replication.

Info

bsr distinguishes Inconsistent data and Outdated data. Inconsistent Data refers to data that cannot be accessed or used in any way. As a typical example, the data on the target side in synchronization is inconsistent. When the synchronization is in progress, the target-side data is partially up-to-date, but some are from the past, so it is not possible to regard it as a single data point. In such a case, if the device has a file system, the file system cannot be mounted or even a file system check cannot be performed.

Outdated data is data that is guaranteed to be consistent with the data at a specific point in time, but is not synchronized with the primary node and the latest data. This happens when the replication link goes down, either temporarily or permanently. There is no problem in using the disconnected outdated data, but this is the data from the past. To prevent this data from being serviced, bsr does not allow promotion of resources with outdated data by default. However, if necessary (in an emergency), you can forcibly promote outdated data.

Events

You can check the real-time event status with the following command. The bsrsetup events2 command can be used with the '--statistics', '--timestamp' options.

Info

C:\Program Files\bsr\bin>bsrsetup events2 --now r0
exists resource name:r0 role:Secondary suspended:no
exists connection name:r0 peer-node-id:1 conn-name:remote-host connection:Connected role:Secondary
exists device name:r0 volume:0 minor:7 disk:UpToDate
exists device name:r0 volume:1 minor:8 disk:UpToDate
exists peer-device name:r0 peer-node-id:1 conn-name:remote-host volume:0
replication:Established peer-disk:UpToDate resync-suspended:no
exists peer-device name:r0 peer-node-id:1 conn-name:remote-host volume:1
replication:Established peer-disk:UpToDate resync-suspended:no
exists -

Efficient synchronization

bsr provides various functions such as FastSync, checksum-based synchronization, truck-based synchronization, and bitmap clear synchronization for efficient synchronization.

Fast Synchronization

bsr changed the existing full synchronization method that performs for the entire disk area to FastSync, which synchronizes only the area used by the file system. For example, if you are only using 100MB on a 1TB disk, initial synchronization can be completed 10 times faster than the existing full synchronization (1TB) because only 100MB disk area is synchronized. FastSync operates at the following times.

Initial full synchronization (bsradm primary --force)
Manual full synchronization (invalidate/invalidate-remote)
Online Verify check (bsradm verify)

Note

Note!

FastSync must first obtain information about the file system usage from the disk space before performing initial synchronization, but if the file system is damaged (broken), information about the used area may be processed inaccurately. If FastSync is processed without recognizing this, it will result in inconsistent consistency between the source and target, so you must be very careful.

Therefore, in order to prepare for this situation, bsr first requests a file system integrity check (chkdsk or fsck) before performing initial synchronization through bsradm primary --force, and enables FastSync when there are no problems with the results.

Before performing bsr initial synchronization, the administrator needs to perform a file system integrity check to determine the health status of the clone disk in advance.

Info

To change to the old FullSync method

bsrcon /set_fast_sync 0

When you want to know the current initial synchronization method

bsrcon /get_fast_sync

Checksum-based synchronization

Checksum data summarization can further improve the efficiency of bsr's synchronization algorithm. Checksum-based synchronization reads blocks before synchronization, obtains a hash summary of the contents on the current disk, and then reads the same sector from the other node and compares it with the obtained hash summary. If the hash match, the re-write for the block is omitted, and if they do not match, synchronization data is transmitted. This method can be advantageous in performance compared to the existing method of simply overwriting the block to be synchronized, and if the file system writes the same contents to the sector again while disconnected (disconnected), resynchronization is omitted for that sector. Overall, it have a more shorten synchronization time.

Truck-based synchronization

Truck-based synchronization by directly importing and configuring disks is suitable for the following situations.

Initially, the amount of data to be synchronized is very large (hundreds of gigabytes or more)
When the rate of change of the data to be copied is expected to be small compared to the huge data size
When available network bandwidth between source and target sites is limited

In the above situation, if you do not synchronize by truck-based synchronization and initialize with the normal device synchronization method, it will take a very long time during synchronization.

Let's say one situation. There is a local node that has been disconnected from being in Primary. That is, the device configuration is complete and the same copy of bsr.conf exists on both nodes. Commands for initial resource promotion have been executed on the local node, but the remote node is not connected yet.

Run the following command on the local node.
Code Block
bsradm new-current-uuid --clear-bitmap <resource>
Create copies of the data to be replicated and the metadata of the data. For example, you could use a hot-swappable drive in the RAID-1 mirror. Of course, in this situation, the RAID set will need to be replaced with a new drive to continue mirroring. However, the disk drive you removed here is a literal copy that can be used elsewhere. If your local block device supports snapshot copy function, you can use it.
Run the following command on the local node. There is no --clear-bitmap option in the second command run.
Code Block
bsradm new-current-uuid <resource>
Configures the same copy of the original data to be physically taken directly for use on remote nodes.
You can directly connect the disk physically, or you can copy the imported data to the existing disk and use it in bit units. This process should be done not only on the mirroring data, but also on the metadata. If such a procedure cannot be accepted, this method cannot proceed.
Start the bsr resource on the remote node.
Code Block
bsradm up <resource>

When both nodes are connected, they will not initiate full device synchronization. Instead, only synchronization of blocks that have changed since the bsradm--clear-bitmap new-current-uuid command was invoked is automatically initiated.

If there is no change, there may be a simple synchronization depending on the area covered in the Activity Log rolled back from the new secondary node.

Bitmap clear synchronization

You can use the option to clear the bitmap (--clear-bitmap) so that it can be quickly sync without an initial full synchronization over a long period of time. The following are examples of these operations.

It can be used to skip the initial sync by creating a new Current UUID and clearing the Bitmap UUID. This use case only works for the metadata just created.

On both nodes, initialize the meta and configure the device. bsradm -- --force create-md res
Start resources of both nodes and recognize each other's volume size at the time of initial handshake. bsradm up res
When both nodes are connected as Secondary / Secondary, Inconsistent / Inconsistent, create a new UUID and clear the bitmap. bsradm new-current-uuid --clear-bitmap res
Now both nodes are in Secondary / Secondary, UpToDate / UpToDate state, and promote one side to Primary to create a file system. bsradm primary res mkfs -t fs-type $(bsradm sh-dev res)

One obvious side effect of this approach is that the replica is full of old garbage (unless you make it the same using other methods), it is expected to find the number of unsynchronized blocks when online verification. This method should never be used in situations where the volume already has data. At first glance it may seem to work, but once you switch to another node, the data that was already there is not replicated, so the data is broken.

Adjusting the synchronization speed

When synchronization is running in the background, the data on the target is temporarily in an inconsistent state. This inconsistent state should be kept as short as possible to ensure consistency, so it is beneficial to have a high enough synchronization rate. However, replication and synchronization share the same network band, and if the synchronization band is set high, replication will be given relatively little bandwidth. If the replication band is lowered, it will affect local I/O latency and result in local I/O performance degradation. If either replication or synchronization unilaterally occupies a lot of bandwidth, it will affect the behavior of the other, so we implement variable-band synchronization, which guarantees the replication band as much as possible while moderating the synchronization band according to the replication situation, and this is the default policy. In contrast, the fixed-band synchronization policy, which guarantees the synchronization band at all times regardless of replication, can cause local I/O performance degradation if used during server operation, so it is not generally recommended and should be used only in special situations.

Info

Replication and synchronization

Replication is the action to reflect the I/O of the disk change occurring locally to the target in real time. replication is performed in the context where the incremental I/O is written to the local disk, thus affecting the local I/O latency.
Synchronization is the operation of matching the data on the source side disk with the data on the target side by out-of-sync area of the entire disk volume this is processed from 0 sector to last sector of volume sequentially.

To clearly differentiate these differences, bsr always describes replication and synchronization separately.

Info

It makes no sense to set the sync rate to a value higher than the maximum disk write speed of the standby node. Because the standby node is the target of ongoing device synchronization, the sync rate cannot be faster than the write speed of its I/O subsystem allows. For the same reason, it makes no sense to set the sync rate to a value higher than the bandwidth available on the replication network.

Fixed rate synchronization

The maximum bandwidth used for resynchronization in the background is determined by the resource's resync-rate option. These options are included in the disk section of the /etc/bsr.conf resource configuration as follows:

Code Block
resource <resource> { disk { resync-rate 40M; c-min-rate 40M; c-plan-ahead 0; ... } ... }

The resync-rate and c-min-rate settings are specified in bytes per second. The default unit is Kibibyte, and the value of 4096 is interpreted as 4 MiB.

Info

Important

Dynamically adjusts the synchronization rate when the c-plan-ahead parameter is set to a positive value. This value is set to 20 by default and should be set to 0 for fixed-rate synchronization.

Variable rate synchronization

Configuring with fixed-bandwidth synchronization is problematic for configurations where multiple resources share a replication/synchronization network. Because the resources share the same replication network, if the synchronization rate is occupied for a particular replication resource channel, the other resources are not guaranteed a fixed synchronization rate. In this case, variable bandwidth synchronization can be configured to dynamically adjust the synchronization rate of each replication channel to proactively adjust the synchronization band in response to other resources taking over. Variable-band synchronization determines an initial synchronization rate (by resync-rate) and then uses an automatic control algorithm to continuously adjust the synchronization rate. This algorithm ensures that the synchronization band is available from c-min-rate to c-max-rate while still allowing replication to operate on the front end. Setting c-max-rate too high will affect the replication band, so it is preferable to set it to match the network band.

The optimal configuration for variable bandwidth synchronization depends on the available network bandwidth, application I/O patterns, and replication link congestion, and the optimal configuration settings may vary depending on whether you are using Replication Accelerator (DRX).

Info
c-min-rate guarantees a minimum synchronization rate of a specified size, regardless of whether you have a fixed-bandwidth or variable-bandwidth setting.

Info

Difference between BSR and DRBD when handling replication and synchronization at the same time

BSR tries to keep the sync band at a value from c-min-rate to c-max-rate when handling synchronization and replication simultaneously, meaning it tries to free up as much sync band as possible.
DRBD forces the synchronization band to drop to the value of c-min-rate when handling synchronization and replication at the same time.

Info

Synchronization speed estimation

You can estimate the synchronization time with the following formula.

t_resync = D/R

t_resyncis the estimated synchronization time.
D is the size of the data to be synchronized under the assumption that it is rarely affected (such as data being modified in the event of a broken network link).
R is the tunable synchronization rate, which has different limits depending on the replication network environment and the processing performance of the I/O subsystem.

Set the synchronization ratio

You can also set the synchronization rate as a percentage of the replication bandwidth.

Code Block
resource <resource> { disk { c-min-rate 40M; resync-ratio "3:1"; ... } ... }

The example above sets the synchronization band to a ratio of 3 replication to 1 synchronization (4 total). However, the sync ratio is compared to the c-min-rate and if the c-min-rate is higher, it is applied as the c-min-rate value. This ensures that you have the minimum amount of synchronization bandwidth.

Congestion mode

Info
Used only in asynchronous replication.

In environments where the replication band is a variable band (WAN), the replication link can sometimes become congested. This causes the primary node's I/O to wait, resulting in a performance degradation of local I/O. Congestion mode is a configuration to respond to this situation.

When congestion is detected, replication is suspended and buffered data is slowly sent to the target while logging local I/O to OOS. During this process, the primary is in an Ahead data state compared to the secondary, and once it finishes sending buffered data, it automatically enters sync mode to synchronize the OOS areas that failed to replicate.

Here is an example of setting up a congestion policy

In the resource configuration file, set the congestion mode with the on-congestion option item and set the congestion detection threshold with the congestion-fill item.

Code Block
resource <resource> { net { sndbuf-size 1G; on-congestion pull-ahead; congestion-fill 900M; congestion-extents 5500; congestion-highwater 20000; ... } ... }

The pull-ahead option is used with congestion-fill, congestion-extents, or congestion-highwater. The recommended values for each property are as follows

Set congestion-fill to approximately 90% of the size of sndbuf-size. If you are integrating a replication accelerator (DRX), set it to 90% of the DRX buffer. However, if the buffer is allocated a large size, say 10GB or more, the 90% threshold may be too large to detect congestion, so this should be adjusted to a reasonable value through tuning.
The recommended value for congestion-extents is 90% of the al-extents setting.
congestion-highwater detects congestion based on packet count. It is particularly appropriate for use in DR environments where capacity-based detection of replication congestion is not suitable. It is set to 20000 by default and is enabled by default. It is disabled when set to 0 and has a maximum value of 1000000.

Info

Transmission buffer (sndbuf) and DRX buffer

It is difficult to allocate a large amount of the transmission buffer (sndbuf) set in bsr because it is allocated directly from kernel memory. This will vary depending on your system, but you will usually need to limit the size to within 1GB. Otherwise, if system kernel memory becomes insufficient due to transmission buffer allocation, system operation and performance may be affected.

Therefore, if you need to configure a large buffer, it is recommended to configure it as a DRX buffer.

Disk flush

If the target node suddenly goes down due to power failure during replication, data loss may occur if the disk cache area is not backed up by a battery backup device (BBWC). In order to prevent this in advance, in the process of writing data to the disk of the target, after data is written to the media, the flush operation is always performed to prevent data loss.

The storage device equipped with BBWC does not need to perform the disk flush operation, so it provides an option to disable the flush as follows.

Code Block
resource <resource> disk { disk-flushes no; md-flushes no; ... } ... }

You should disable device flushing only when running bsr on devices with battery backup write cache (BBWC). Most storage controllers automatically disable the write cache when the battery is exhausted and switch to write through mode when the battery is exhausted.

Consistency verification

Consistency verification is a function that performs replication traffic in real-time in block units during replication or compares block-by-block based on hash summaries to verify that the source and target data are completely matched in whole (used) disk volume units.

Traffic integrity check

bsr can use cryptographic message digest algorithms to verify message integrity between both nodes. When this function is used, bsr generates a message summary of all data blocks, delivers it to the other node, and verifies the integrity of the replication packet at the other node. If the summarized blocks do not match each other, retransmission is requested.

When replicating data, bsr can protect the source data against the following error conditions through this consistency check, and failure to respond to such situations can potentially cause data corruption during replication.

Bit errors (bit flips) that occur in data transferred between main memory and the network interface of the transmitting node.
- If the TCP checksum offload function provided by LAN Card is recently activated, hardware bitflip may not be detected by software.
Bit errors that occur on data being transferred from the network interface to the receiving node's main memory (the same applies for TCP checksum offloading).
Damage due to a bug or race condition within the network interface firmware or driver.
Bit flips or random damage injected by the recombination network component between nodes (if not using direct connection, back-to-back connection).

Replication traffic consistency checking is disabled by default. To enable it, add the following to the resource configuration in /etc/bsr.conf.

Code Block
resource <resource> { net { data-integrity-alg <algorithm>; } ... }

<algorithm> is a message hashing compression algorithm supported by the kernel cryptography API in the system's kernel configuration. On Windows, only crc32c is supported.

After changing the resource configuration of both nodes identically, execute bsradm adjust <resource> on both nodes to apply the changes.

Online Verification

Online Verification is a function to check the consistency of block-specific data between nodes during service is online . it does not duplicate check, and it is basically used to efficiently use network bandwidth and check the area used by the file system.

The online verification sequentially encrypts all data blocks on a specific resource storage at one node (verification source), and then sends the summarized content to a verification target to summarize the contents of the same block location and compare it. If the summarized content does not match, the block is marked out-of-sync and is later synchronized. Here, network bandwidth is effectively used because only the smallest summary is transmitted, not the entire contents of the block.

Since the operation to verify the consistency of the resource is checked during operation, there may be a slight decrease in replication performance when online verification and replication are performed simultaneously. However, there is an advantage that there is no need to stop the service, and there is no downtime of the system during the scan or synchronization process after the scan.

Generally, it is common practice to perform tasks according to online verification as scheduled tasks in the OS and perform them periodically during periods of low operational I/O load.

Enable

Online verification is disabled by default, but can be activated by adding the following entry to the resource configuration in bsr.conf.

Code Block
resource <resource> { net { verify-alg <algorithm>; } ... }

algorithm means the message hashing algorithm, and only supports crc32c in Windows.

To enable online verification, make the same resource configuration changes on both nodes, then run bsradm adjust <resource> on both nodes to apply the changes.

OV run

After enabling online verification, you can run the test using the following command:

Info
drbdadm verify <resource>

When an online verification is executed, bsr finds and displays the unsynchronized block in <resource> and records it. At this time, all applications that use the device can operate without any restrictions, and the role of the resource can also be changed.

The verify command performs a verification after changing the disk status to UpToDate. Therefore, it is desirable to perform UpToDate on the replication source node side after the initial sync is completed. For example, if you perform verification on the disk node side of the Inconsistent state, the disk state is changed to UpToDate, which may cause operational problems.

If an out-of-sync block is detected while verification is running, after verification is complete, you can synchronize with the next command. At this time, the direction of synchronization is from the primary node to the secondary direction, and synchronization is not performed in the secondary/secondary state. Therefore, in order to solve the OOS due to online verification, promotion to the primary on the source side node is required.

Code Block
drbdadm disconnect <resource> drbdadm connect <resource>

Automatic verification

If you need to do a regularity check, register the bsradm verify <resource> command to the task scheduler in the following way.

First, create a script file with the following contents in a specific location on one of the nodes.

Info
drbdadm verify <resource>

To verify all resources, you can use the all keyword instead of <resource>.

The following is an example of creating a scheduled task using schtasks (windows schedule setting command). With the following settings, online verification is performed every Sunday at 00:42 AM.

Code Block
schtasks /create /tn "drbd_verify" /tr "%wdrbd_path%\verify.bat" /sc WEEKLY /D sun /st 00:42

Persist Role

While resource roles can be changed based on operational circumstances, sometimes you may want to persist roles. (BSR 1.7.3 and later)
A resource with persist-role set will continue to have the resource role explicitly specified (with the bsradm command) at the time of restart. This works in any situation where the replication service or system reboots, causing the resource to restart.

Code Block
resource <resource> { netoptions { data-integrity-alg <algorithm>; } ... }

<algorithm>은 시스템의 커널 구성에서 커널 암호화 API가 지원하는 메시지 해싱 압축 알고리즘입니다. Windows 에서는 crc32c 만 지원합니다.

양 노드의 리소스 구성을 똑같이 변경한 후, 양 노드에서 bsradm adjust <resource>를 실행하여 변경사항을 적용시킵니다.

온라인 정합성 검사

온라인 정합성 검사는 장치 운영 중에 노드 간의 블록별 데이터의 정합성을 확인하는 기능입니다. 정합성 검사는 네트워크 대역폭을 효율적으로 사용하고 중복된 검사를 하지 않습니다.

온라인 정합성 검사는 한 쪽 노드에서(verification source) 특정 리소스 스토리지상의 모든 데이터 블럭을 순차적으로 암호화 요약(cryptographic digest)시키고, 요약된 내용을 상대 노드(verification target)로 전송하여 같은 블럭위치의 내용을 요약 비교 합니다. 만약 요약된 내용이 일치하지 않으면, 해당 블럭은 out-of-sync로 표시되고 나중에 동기화대상이 됩니다. 여기서 블럭의 전체 내용을 전송하는 것이 아니라 최소한의 요약본만 전송하기 때문에 네트워크 대역을 효과적으로 사용하게 됩니다.

리소스의 정합성을 검증하는 작업은 운영 중에 검사하기 때문에 온라인 검사와 복제가 동시에 수행될 경우 약간의 복제성능 저하가 있을 수 있습니다. 하지만 서비스를 중단할 필요가 없고 검사를 하거나 검사 후 동기화 과정 중에서 시스템의 다운 타임이 발생하지 않는 장점이 있습니다.

보통 온라인 정합성 검사에 따른 작업은 OS에서 예약된 작업으로 등록하여 운영 I/O 부하가 적은 시간 대에 주기적으로 수행하는 것이 일반적인 사용법입니다.

활성화

...

    persist-role yes;
  }
  ...
}

One-way replication

If you always want to have only one-way replication from the primary node to the standby node, without swtichover or failover, consider the target-only attribute on the standby node side. (BSR 1.7.3 and later)

Set the persist-role attribute described above in the resource options section to fix the roles of the primary and standby nodes.
Set the target-only attribute on the standby node side to force the replication/synchronization direction from the primary node to the standby node only.

A target-only node is prohibited from acting as a source in all replication/sync operations, including explicit commands, and can only have a target role; any manual synchronization or promotion commands that act as a source are blocked (but promotion is allowed on disconnection).

Code Block
resource <resource> { net options { { verifypersist-alg <algorithm>role yes; } ... }

algorithm 은 메시지 해싱 알고리즘을 말하며 Windows 에선 crc32c 만 지원합니다.

온라인 검증을 활성화 하기 위해 양 노드의 리소스 구성을 똑같이 변경한 후, 양 노드에서 bsradm adjust <resource>를 실행하여 변경사항을 적용시킵니다.

온라인 정합성 검사 실행

온라인 정합성 검사를 활성화한 후, 다음 명령을 사용하여 검사를 실행할 수 있습니다.

Info
drbdadm verify <resource>

온라인 검사가 실행되면, bsr 은 <resource>에서 동기화되지 않은 블록을 알아내 표시하고 이를 기록합니다. 이때 디바이스를 사용하는 모든 응용 프로그램은 아무런 제약 없이 동작할 수 있으며, 리소스의 역할 변경도 가능합니다.

verify 명령은 디스크 상태를 UpToDate로 변경한 후 검증을 수행합니다. 따라서 초기싱크가 완료된 이후 UpToDate 인 복제 소스 노드 측에서 수행하는 것이 바람직 합니다. 예를 들어, Inconsistent 상태의 디스크 노드 측에서 verify를 수행하면 디스크 상태가 UpToDate로 변경 되어 운영 상 문제가 될 수 있으므로 주의가 필요합니다.

검증이 실행되는 동안 out-of-sync 블록이 감지되면, 검증이 완료된 후에 다음 명령으로 동기화할 수 있습니다. 이 때 동기화가 되는 방향은 Primary 노드에서 Secondary 방향으로 이루어지며 Secondary/Secondary 상태에서는 동기화를 진행하지 않습니다. 따라서 Online 검증에 따른 OOS를 해소하기 위해선 소스 측 노드에 대한 Primary로의 승격이 요구됩니다.

Code Block
drbdadm disconnect <resource> drbdadm connect <resource>

자동 검사

정기적으로 정합성 검사를 할 필요가 있다면, 다음과 같은 방법으로 bsradm verify <resource> 명령을 작업 스케줄러에 등록합니다.

우선 노드 중 하나에서 특정 위치에 다음과 같은 내용의 스크립트 파일을 만듭니다.

Info
drbdadm verify <resource>

모든 리소스를 검증하려면 <resource> 대신 all 키워드를 사용하면 됩니다.

다음은 schtasks(windows 스케줄 설정 명령어)를 사용해 예약된 작업을 생성하는 예 입니다. 다음과 같이 설정 하면 매주 일요일 자정 42분에 온라인 정합성 검사를 수행하게 됩니다.

...

 
  on active {
    ...
  }
  
  on standby-DR {
    ...
    options {
      target-only yes;
      ...
    }
  }
  ...
}

Info
Verify data on a target-only node After disconnecting replication, you can verify data by promoting it. At the time of promotion to verify data, SB has occurred, so to return to replication, demote again and process as SB resolution.

Page Comparison

Versions Compared

Old Version 4

New Version Current

Key

Demotion

Resource down

demotion

disconnect

Demotion

Resource down

demotion

disconnect

Reconfigurations

정적 설정

전체 재구성

볼륨 크기 조정

윈도우즈

리눅스

Static settings

Full reconfigurations

Resizing volume

Windows

Linux

리소스 삭제

조회

버전

Delete resource

Inquiry

Version

상태정보

Status Information

이벤트

동기화 속도 조정

고정대역 동기화

가변대역 동기화

효율적인 동기화

체크섬 기반 동기화

운송 동기화

혼잡 모드

디스크 플러시

정합성 검증

트래픽 검사

Events

Efficient synchronization

Fast Synchronization

Checksum-based synchronization

Truck-based synchronization

Bitmap clear synchronization

Adjusting the synchronization speed

Fixed rate synchronization

Variable rate synchronization

Set the synchronization ratio

Congestion mode

Disk flush

Consistency verification

Traffic integrity check

Online Verification

Enable

OV run

Automatic verification

Persist Role

온라인 정합성 검사

활성화

One-way replication

온라인 정합성 검사 실행

자동 검사