1 Throughput
- 1.1 Hardware
  - 1.1.1 I/O sub system
  - 1.1.2 Network
- 1.2 Estimation of throughput
- 1.3 Throughput optimization tuning
  - 1.3.1 req-buf-size
  - 1.3.2 max-buffers
  - 1.3.3 al-extents
  - 1.3.4 sndbuf-size
2 Latency
- 2.1 Hardware
  - 2.1.1 I/O sub system
  - 2.1.2 Network
- 2.2 Estimation of latency
- 2.3 Latency optimization tuning
  - 2.3.1 Disable TCP Delayed ACK
  - 2.3.2 accelbuf-size

Throughput

Let's look at some hardware considerations and tuning topics for optimizing throughput for bsr.

Hardware

The throughput of bsr is affected by the underlying I/O subsystem(disk, controller, and cache) and network bandwidth within the system.

I/O sub system

The I/O subsystem's throughput is determined by the number of disks configured in parallel. Modern SCSI or SAS disks can generally stream writes of about 40 MB/s to a single disk. In the case of a striped configuration, disks are distributed and I/O is recorded in parallel, so the processing speed may increase in multiples of the number of disks. For example, you can expect a throughput of 120MB/s when using 3 of 40MB/s disks in a RAID-0 or RAID-1 + 0 configuration, and 200MB/s for 5 disks.

In general, the mirroring (RAID-1) method has little effect on processing speed. The RAID-5 method improves processing speed, but is somewhat slower than striping.

Network

Network throughput is usually determined by the amount of traffic and network infrastructure (routing/switching) equipment. However, because replication links for bsr generally use dedicated lines, they have little to do with these environmental conditions. Therefore, network throughput can be improved by switching to a higher-throughput transmission band (such as 10 Gigabit Ethernet) or through multiple links such as bonding network drivers.

Estimation of throughput

As described above, the estimation is made considering the I/O subsystem and network bandwidth to estimate the processing speed.

Theoretically, the minimum of the maximum values of each of the two factors can be estimated as the maximum processing speed, and the value is estimated by assuming an overhead of about 3%.

Let's say a cluster node with Gigabit Ethernet connected in an I/O subsystem with 200MB/s throughput. Assuming that Gigabit Ethernet's TCP connection provides 110MB/s throughput, you can expect a maximum throughput of approximately 107MB/s by lowering it by 3% considering the network bottleneck.
In addition, considering that the I/O subsystem is provided with 100 MB/s of throughput, the maximum throughput of bsr can be expected up to 97 MB/s, considering the bottleneck here.

Throughput optimization tuning

Here are some recommendations among several configuration options that can affect bsr performance. However, since performance is highly dependent on hardware, the effect of adjusting the options described here may vary from system to system. Therefore, these recommendations are not an absolute factor in solving all performance bottlenecks.

req-buf-size

Specifies the maximum size of the primary node's I/O request buffer queue (req-buf). This buffer is a place for queuing the I/O first when I/O occurs to the local disk. It is used for first buffering before the I/O request is passed to the TCP send buffer.

In the normal case where replication is desired, req-buf is unlikely to be full, but if the system load increases due to some factors, replication processing may be delayed and req-buf may become full. (If you are under this replication delay and under load, it may cause replication disconnection due to ack delay.) In order to prepare for this situation, you need to set more req-buf size. The default is 100 MB per resource.

max-buffers

max-buffers is the maximum number of peer request buffers allocated to write data to the target's disk, which affects the write performance of the secondary node. The default is 16000, and around 8000 is good for most hardware RAID controllers.

The max-buffers value must be at least equal to or greater than the max-epoch-size value.

resource <resource> {
  net {
    max-buffers 8000;
    max-epoch-size 8000;
    ...
  }
  ...
}

al-extents

If your application using bsr is write-intensive and has frequent writes of small size, it is recommended to increase the activity log size. Otherwise, metadata updates will occur frequently and write performance may suffer. The following is an example of expanding the AL size to 65001.

>bsradm create-md --al-stripe-size-kB 288 r0
...
>bsradm dump-md r0
...
al-stripe-size-4k 72;
...

Specify al-extents and congestion-extents.

resource <resource> {
  disk {
    al-extents 65001;
    ...
  }
  net {
    congestion-extents     58501;  # 90% of the al-extents
  }
}

sndbuf-size

The transmission buffer implements the optimization of the transmission amount through buffering. Since TCP/IP transmission is an ACK-based protocol, the more the segment is transmitted, the greater the protocol overhead. Therefore, from a network transmission perspective, it is advantageous to collect and send as much data as possible when sending data. Since the transmit buffer implements this part, optimization can be achieved naturally. In particular, in a situation in which a small amount of data is continuously transmitted, the amount of data can be greatly improved by collecting and sending the data to the transmission buffer queue for a certain amount. You should consider optimizing with sndbuf-size in the following situations:

When replication target files are divided into small unit files, replication I/O itself occurs as a small unit I/O of 4KB to 8KB.
Online Verify in small test units

Latency

It covers latency optimization of bsr. To minimize latency, let's review it in terms of hardware and look at some setup options.

Hardware

The execution time of bsr is affected by the execution time of the basic I / O subsystem (disk, controller and cache) and the processing time of the network.

I/O sub system

The execution time of the I / O subsystem is mainly determined by the rotational speed (RPM) of the disk. Therefore, using a disk with a faster RPM is obviously the general approach to improve performance.

Similarly, using a battery-backed write cache (BBWC) can also reduce write time. Most recent storage devices are equipped with such a cache, and the read/write cache rate can be set by the administrator. To increase the performance of write I/O, it is recommended to remove read cache and use it for write cache.

Network

Execution time on the network essentially means the time it takes for a packet to return from host to host (RTT). There are a few factors that can affect this, but in a dedicated-line environment, it is highly recommended that you use a back-to-back network for replicated lines as they are rarely affected by these factors. If Gigabit Ethernet is used in a constant bottleneck environment, the latency will typically be on the order of 100 to 200 microseconds (μs) packet RTT.

Estimation of latency

When estimating latency associated with throughput, there are some important constraints to consider.

Latency is within the range of RAW I/O subsystem performance and available network bottlenecks.

The sum of these two times is, in theory, the smallest latency that affects bsr. And add a little less than 1% to it. For example, if the local disk has a latency of 3 ms and a network latency of 0.2 ms, the expected bsr latency is 3.2 ms. The latency is about 7% higher than just writing to the local disk. Latency is also affected by several other factors, such as CPU cache misses and context switching.

Latency optimization tuning

Factors that can be considered for tuning in terms of latency are special functions additionally provided by the network protocol or transport layer.

These features are provided by third-party vendors, OS component, etc. to increase the overall network performance, but may have a negative impact on performance in terms of latency. For example, there are functions such as Delayed ACK Time provided by TCP, MTU size, Jumbo packet provided by NIC layer, and Large Send Offload (LSO). It is directly or indirectly related in the hierarchy that replication data is sent or received.

If the replication performance is unexpectedly low, you may need to disable these features if you are unable to solve the problem.

Disable TCP Delayed ACK

In particular, Delayed ACK, a basic property of TCP in Windows, helps to optimize transmission bandwidth but has a bad effect on delay time, so it should be disabled manually. When bsr is first installed, this is automatically performed by a script during the installation process, but if the LAN card is replaced later, this task must be performed manually again.

The following is a command to disable Delayed Ack for a network interface. Specify the interface by IP address or guid.

This command is for Windows only. It sets TCP option values in the Windows registry and applies the contents to the network interface. You can check the registry contents in the path below to see if this option has been applied properly.

HKLM\SYSTEM\CurrentControlSet\services\Tcpip\Parameters\Interfaces\{Interface GUID}\

TcpAckFrequency(REG_DWORD) = 1

accelbuf-size

bsr implements double buffering in the phase between local I/O and transmission buffer to quickly complete local I/O, and this is called acceleration buffer.

The acceleration buffer was implemented to optimize the delay of local I/O, and especially optimizes the delay for I/O in small units of 4KB or less.

BSR User's Guide - eng

Optimizations

Analytics

Throughput

Hardware

I/O sub system

Network

Estimation of throughput

Throughput optimization tuning

req-buf-size

max-buffers

al-extents

sndbuf-size

Latency

Hardware

I/O sub system

Network

Estimation of latency

Latency optimization tuning

Disable TCP Delayed ACK

accelbuf-size

Related content